The router that proves its choice
Every task,
priced by outcome.
Ricardo routes every model call to the comparative-advantage choice and proves it on whether the work actually succeeded, not on an opinion. It runs on real agent traffic — Hermes, the open agent by Nous Research, doing code, web, vision and browser work.
How it works.
Your request comes in
“Refactor this module.” “Summarise the filing.” Any task, any modality — through one drop-in endpoint.
Ricardo routes it
To the model that does this task best per dollar — chosen on whether the work actually succeeded, not on a guess.
Same result, smaller bill
The outcome you’d pay frontier prices for — often at a fraction of the cost. With a per-call receipt you can audit.
See it choose.
One real task: “write is_valid_ipv4 — reject leading zeros and out-of-range octets.” We run each model’s actual code on the tricky case and watch the bill:
deepseek-v4-flash → ✓ passes $0.00012
claude-opus-4.8 → ✓ passes $0.00368 # same answer · ~127× / token
qwen-35b · local → ✓ passes ~$0 → all correct · route the cheap one
# WEB · agentic research · execution-graded
qwen-35b · local ✗ outcome 0.09 "couldn't gather usable info"
claude-opus-4.8 ✓ outcome 0.60 # +0.175 → escalate
Same models, opposite call. On code, cheap is already right — route the $0.00012 model, not the ~127×-pricier one (frontier is waste). On web, that same model fails (0.09) — so it escalates. The right call flips per task; a static “always cheap” or “always frontier” rule loses on half. We prove each call by re-running the work — receipt attached.
Same task. Two bills. Your choice.
Equivalent intelligence got ~700× cheaper since 2023; frontier intelligence got ~300× pricier. The space between is the largest cost–quality spread in business history — the routing decision every team is making, knowingly or not.
The gap is what Ricardo routes. measured here · one code task, all tests pass · deepseek-flash $0.00003 vs Opus $0.00426 — 142× the bill, identical result.
Why now.
Three forces, all pointing at routing — and all accelerating.
The widest spread in history
Equivalent capability fell ~700× since 2023 while frontier rose ~300× — the largest cost–quality gap commodity software has seen.
Cheap is now good enough
The best local model now wins or ties cloud 71% of the time, up from 23% in 2023. The good-enough-cheap zone widens every quarter.
The agents we route are surging
Nous Hermes tokens on OpenRouter grew 8.8B → 16.5B / month in three months (orint data) — the open agent our deployment routes is itself taking off.
We start where the R&D ends.
The hard, slow layer — capable open models, agents, training infrastructure — took Nous Research ~3 years and a $1B raise to mature. It’s done, and it’s open. We don’t rebuild it; we route it, by outcome. Years of foundation; months to value.
Routing has an opinion now.
Every router today sorts by price, latency, uptime. None of them has a view on whether the answer was any good. So nobody running an agent gets the best result for this task — they get a result from whichever model they happened to pick.
Ricardo is the opinion layer. Named for David Ricardo's comparative advantage: even if one model is absolutely best at everything, you still win by routing a task to the cheaper model that holds the advantage there. Specialisation beats a single frontier model on price and quality. The idea is two hundred years old; the application is new.
The catch every benchmark gets wrong:
A long agentic task can't be graded from its transcript. We measured it — the LLM judge scored a batch of web tasks 0.50; their real execution outcome was 0.09. The judge is polite. The work failed.
Hermes already knows what worked — retries, tool errors, whether the task completed. That ground-truth signal, which standalone routers can't reach, is the reward Ricardo learns from.
Evidence, not slideware.
| Model | Quality | Uplift | $ / call | Verdict |
|---|---|---|---|---|
| qwen 35b local | 0.43 | — | $0 | free default |
| claude-opus-4.8 | 0.60 | +0.175 | $0.043 | escalate ⬩ 5/8 |
| minimax-m2.5 | 0.45 | +0.03 | $0.001 | noise |
| glm-4.6 | 0.23 | −0.20 | $0.002 | dominated |
One experiment, $0.37, run against real captured traffic. It found that only a frontier model earns its escalation on web — and a single dial (λ, quality-points-per-dollar) turns the whole Pareto choice from cost-saving to quality-hungry.
BizAgentBench v1 — our own benchmark of business-agent tasks, every run graded by a fixed independent judge. Routing per task to the cheapest model that clears the quality bar scores 0.893 — above the best single model at 0.840 and +44% over the gpt-4.1-mini default at 0.618. Not just a cost story: per-task comparative advantage buys quality no single model has.
▸ Read the paper — Ricardo: Outcome-Grounded Routing (PDF)- 158real Hermes trajectories, captured in production
- Ground truthreward from execution — did it run, retry, complete
- Liveshadow router deployed, logging every call
- Reproducibleweb tasks replayed deterministically from recorded results — a benchmark from real usage
“The moat” is a rotating meme.
A rotating shell game — and the skeptics are right.
Each is a real component of an AI system; none, alone, is a moat. A router is copyable in a weekend (RouteLLM, Not Diamond, every gateway). If our pitch were “we built a router,” the right response is a shrug.
So it isn’t. The moat was never one layer — it’s the loop across all three, on data nobody has: evals (CodeSOTA) grade it, the router (Ricardo) acts on it, owned traffic (Hermes) feeds it. The shell game ends when you stop chasing the layer and start owning the compounding data between them.
- Not a slogana number you can run: same tests pass, ~127× the price per token
- Not opinionreward from execution — did the work succeed — not an LLM judge
- Not buyablethe corpus exists only from real owned traffic — no competitor can buy or clone it
The moat is the loop, not the model.
The mechanism — shadow A/B, score, promote-when-better — is industry-standard. The edge is three things no one else combines: it is multimodal, its judge is calibrated to human votes, and it runs on traffic we own — every task our deployment runs becomes a private (prompt → model → verified-outcome) datapoint nobody can scrape, synthesize, or buy.
Hermes works
Real tasks across code, web, vision, browser — generating real execution signal, all day.
Ricardo decides
Per task, per modality: free local by default, escalate to a frontier model only where the outcome is worth the cost.
Outcome feeds back
Did it succeed? That ground-truth reward sharpens the table — better routing, lower cost, a better agent, more traffic.
Three products. One loop.
Ricardo isn't a cold start. Three live products already produce exactly what routing-by-outcome needs — the market map, the verifiers, and the traffic — and each feeds the next.
We see the whole market
750 models · 577 days of OpenRouter. 64% of volume is cheap open models; the 36% Western frontier keeps the revenue. The 9.6× price gap — and where volume is migrating — is the routing opportunity, quantified.
ort.fabryka.ai →Ground truth, per task
Executable RL-verifier environments — audio, voice, code — plus a competitive sweep of 28 eval vendors. Which model actually wins which task, as runnable checks. The grading layer Ricardo routes on, already built.
codesota.com →Owned outcome signal
A live agent riding the open-model wave orint tracks — emitting real execution outcomes (ran, retried, completed). A private (prompt → model → outcome) flywheel no router can buy.
her.fabryka.ai →orint sees the menu & the spread → CodeSOTA grades it on ground truth → Hermes supplies live traffic & reward → Ricardo routes by comparative advantage → a cheaper, better Hermes → more traffic → orint sees the shift. The loop is the moat.
Drop-in for power users.
Hermes speaks the OpenRouter contract. Point your base URL at Ricardo and nothing changes — except a lower bill at equal quality, and a flywheel quietly learning which model to use for your work.
- One endpointOpenAI / OpenRouter-compatible, zero code change
- Per-call receiptsevery routed decision is logged and auditable
- Your λset how much quality is worth — cost-thrifty to frontier-hungry
export HERMES_BASE_URL=https://her.fabryka.ai/v1
# every task now routes to the proven-best
# model for its modality, and is recorded
hermes run "refactor this module and run the tests"
→ code · routed: qwen 35b · $0.000
hermes browse "find & summarise the Q3 filing"
→ web · routed: claude-opus · $0.041
✓ outcome recorded → flywheel
Go to market.
Land cheap, prove instantly, expand on usage. Product-led and dev-first — adoption is one line of config, because the value is a receipt, not a pitch. Low CAC by design.
Ride the Hermes wave
Hermes is the #1 open agent (180k ⭐); its users already overpay across providers. We’re the drop-in routing layer — riding its distribution instead of buying our own. Dogfood first, then open-source the client.
Every OpenRouter user
The endpoint is OpenRouter-compatible — the whole developer base switches in one line. Self-serve, usage-based. The savings receipt is the entire sales deck.
Teams with real bills
A team at $60k/mo on inference leaves ~$300k/yr on the table. Bottoms-up dev installs become org procurement; per-call receipts are the CFO artifact.
Channels · owned, ~zero CAC
- orintmarket intelligence — draws exactly the people who care about cost & routing
- CodeSOTAcredibility + a publishable benchmark (HermesBench) for inbound
- The papertechnical inbound, researchers, recruiting
- Open sourcebottoms-up dev distribution — Nous’s exact playbook
- Riding Hermesthe fastest-growing open agent’s growth is our growth
Access & the round.
For power users
Run Hermes with outcome-routing on your own workloads. Early access is opening now.
For clients
Better agent outcomes at lower cost, with per-call receipts you can audit. Pilots welcome.
For investors
A self-improving routing flywheel on owned multimodal traffic. We're raising. Let's talk.