RICARDO
Fabryka · her.fabryka.ai Routing Dispatch No. 01 Autonomous Agents

The router that proves its choice

Every task,
priced by outcome.

Ricardo routes every model call to the comparative-advantage choice and proves it on whether the work actually succeeded, not on an opinion. It runs on real agent traffic — Hermes, the open agent by Nous Research, doing code, web, vision and browser work.

Fig. 1 — Cost · Quality frontierBizAgentBench v1
0.80 0.70 0.60 judge score · same 50 tasks $0.0006 / task $ / task · log → gpt-4.1-mini · incumbent, dominated gpt-5-nano · 0.66 deepseek-flash · 0.71 deepseek-pro · 0.78 kimi-k2.6 · dominated
Measured · 50-task spread · judge-scored2026-06-10
0
real tasks captured
0
to find the better route
0
quality uplift, routed vs default
data flywheel, owned

How it works.

STEP 1

Your request comes in

“Refactor this module.” “Summarise the filing.” Any task, any modality — through one drop-in endpoint.

STEP 2

Ricardo routes it

To the model that does this task best per dollar — chosen on whether the work actually succeeded, not on a guess.

STEP 3

Same result, smaller bill

The outcome you’d pay frontier prices for — often at a fraction of the cost. With a per-call receipt you can audit.

See it choose.

One real task: “write is_valid_ipv4 — reject leading zeros and out-of-range octets.” We run each model’s actual code on the tricky case and watch the bill:

# CODE · is_valid_ipv4("01.2.3.4") → must be False · 18 hidden tests
deepseek-v4-flash → ✓ passes  $0.00012
claude-opus-4.8   → ✓ passes  $0.00368  # same answer · ~127× / token
qwen-35b · local  → ✓ passes  ~$0  → all correct · route the cheap one

# WEB · agentic research · execution-graded
qwen-35b · local  ✗ outcome 0.09  "couldn't gather usable info"
claude-opus-4.8   ✓ outcome 0.60  # +0.175 → escalate

Same models, opposite call. On code, cheap is already right — route the $0.00012 model, not the ~127×-pricier one (frontier is waste). On web, that same model fails (0.09) — so it escalates. The right call flips per task; a static “always cheap” or “always frontier” rule loses on half. We prove each call by re-running the work — receipt attached.

Same task. Two bills. Your choice.

Equivalent intelligence got ~700× cheaper since 2023; frontier intelligence got ~300× pricier. The space between is the largest cost–quality spread in business history — the routing decision every team is making, knowingly or not.

Fig. 2 — Indexed cost · same capability vs frontierlog scale · Jan 2023 = 100×
100,000× 10,000× 1,000× 100× 10× 0.1× 0.01× 2023 2024 2025 2026 indexed cost · log scale The Routing Opportunity Frontier intelligence ~300× pricier / task Equivalent intelligence ~700× cheaper / task

The gap is what Ricardo routes. measured here · one code task, all tests pass · deepseek-flash $0.00003 vs Opus $0.00426 — 142× the bill, identical result.

Why now.

Three forces, all pointing at routing — and all accelerating.

700×cheaper

The widest spread in history

Equivalent capability fell ~700× since 2023 while frontier rose ~300× — the largest cost–quality gap commodity software has seen.

71%win/tie vs cloud

Cheap is now good enough

The best local model now wins or ties cloud 71% of the time, up from 23% in 2023. The good-enough-cheap zone widens every quarter.

+87%Mar→May ’26

The agents we route are surging

Nous Hermes tokens on OpenRouter grew 8.8B → 16.5B / month in three months (orint data) — the open agent our deployment routes is itself taking off.

We start where the R&D ends.

The hard, slow layer — capable open models, agents, training infrastructure — took Nous Research ~3 years and a $1B raise to mature. It’s done, and it’s open. We don’t rebuild it; we route it, by outcome. Years of foundation; months to value.

The foundation · the open ecosystem · ~3 years of R&D
2023open fine-tunes
2024free 405B on OpenRouter
2025training infra + $1B
2026#1 open agent
Ricardo · applied from day one route → prove → save. Months, not years.
↑ we enter here — on the mature foundation, straight at the applied value
§01

Routing has an opinion now.

Every router today sorts by price, latency, uptime. None of them has a view on whether the answer was any good. So nobody running an agent gets the best result for this task — they get a result from whichever model they happened to pick.

Ricardo is the opinion layer. Named for David Ricardo's comparative advantage: even if one model is absolutely best at everything, you still win by routing a task to the cheaper model that holds the advantage there. Specialisation beats a single frontier model on price and quality. The idea is two hundred years old; the application is new.

The catch every benchmark gets wrong:

A long agentic task can't be graded from its transcript. We measured it — the LLM judge scored a batch of web tasks 0.50; their real execution outcome was 0.09. The judge is polite. The work failed.

Hermes already knows what worked — retries, tool errors, whether the task completed. That ground-truth signal, which standalone routers can't reach, is the reward Ricardo learns from.

§02

Evidence, not slideware.

Routing table — web tasks · incumbent = local qwen 35b (≈$0)
ModelQualityUplift$ / callVerdict
qwen 35b local0.43$0free default
claude-opus-4.80.60+0.175$0.043escalate ⬩ 5/8
minimax-m2.50.45+0.03$0.001noise
glm-4.60.23−0.20$0.002dominated

One experiment, $0.37, run against real captured traffic. It found that only a frontier model earns its escalation on web — and a single dial (λ, quality-points-per-dollar) turns the whole Pareto choice from cost-saving to quality-hungry.

Fig. 3 — Routing boundsBizAgentBench · n=14 common tasks
THE ROUTING GAP default · always gpt-4.1-mini 0.618 best single model · always deepseek-pro 0.840 RICARDO · cheapest model clearing the bar, per task 0.893 oracle · best per task, hindsight upper bound 0.939
Routing beats the best single model — comparative advantage, measured2026-06-10

BizAgentBench v1 — our own benchmark of business-agent tasks, every run graded by a fixed independent judge. Routing per task to the cheapest model that clears the quality bar scores 0.893 — above the best single model at 0.840 and +44% over the gpt-4.1-mini default at 0.618. Not just a cost story: per-task comparative advantage buys quality no single model has.

▸ Read the paper — Ricardo: Outcome-Grounded Routing (PDF)
  • 158real Hermes trajectories, captured in production
  • Ground truthreward from execution — did it run, retry, complete
  • Liveshadow router deployed, logging every call
  • Reproducibleweb tasks replayed deterministically from recorded results — a benchmark from real usage

“The moat” is a rotating meme.

2024finetuning is the moat
2025evals are the moat
2026model router is the moat
the truthnone alone is — the loop is

A rotating shell game — and the skeptics are right.

Each is a real component of an AI system; none, alone, is a moat. A router is copyable in a weekend (RouteLLM, Not Diamond, every gateway). If our pitch were “we built a router,” the right response is a shrug.

So it isn’t. The moat was never one layer — it’s the loop across all three, on data nobody has: evals (CodeSOTA) grade it, the router (Ricardo) acts on it, owned traffic (Hermes) feeds it. The shell game ends when you stop chasing the layer and start owning the compounding data between them.

  • Not a slogana number you can run: same tests pass, ~127× the price per token
  • Not opinionreward from execution — did the work succeed — not an LLM judge
  • Not buyablethe corpus exists only from real owned traffic — no competitor can buy or clone it
§03

The moat is the loop, not the model.

The mechanism — shadow A/B, score, promote-when-better — is industry-standard. The edge is three things no one else combines: it is multimodal, its judge is calibrated to human votes, and it runs on traffic we own — every task our deployment runs becomes a private (prompt → model → verified-outcome) datapoint nobody can scrape, synthesize, or buy.

01 · OWN THE AGENT

Hermes works

Real tasks across code, web, vision, browser — generating real execution signal, all day.

02 · OWN THE ROUTER

Ricardo decides

Per task, per modality: free local by default, escalate to a frontier model only where the outcome is worth the cost.

03 · OWN THE SIGNAL

Outcome feeds back

Did it succeed? That ground-truth reward sharpens the table — better routing, lower cost, a better agent, more traffic.

Three products. One loop.

Ricardo isn't a cold start. Three live products already produce exactly what routing-by-outcome needs — the market map, the verifiers, and the traffic — and each feeds the next.

ORINT · THE MARKET MAP
34Ttokens / week

We see the whole market

750 models · 577 days of OpenRouter. 64% of volume is cheap open models; the 36% Western frontier keeps the revenue. The 9.6× price gap — and where volume is migrating — is the routing opportunity, quantified.

ort.fabryka.ai →
CODESOTA · THE VERIFIERS
17benchmarks + RL envs

Ground truth, per task

Executable RL-verifier environments — audio, voice, code — plus a competitive sweep of 28 eval vendors. Which model actually wins which task, as runnable checks. The grading layer Ricardo routes on, already built.

codesota.com →
HERMES · THE TRAFFIC
158+real tasks & rising

Owned outcome signal

A live agent riding the open-model wave orint tracks — emitting real execution outcomes (ran, retried, completed). A private (prompt → model → outcome) flywheel no router can buy.

her.fabryka.ai →
orint market map CodeSOTA verifiers Hermes traffic + reward Ricardo

orint sees the menu & the spread CodeSOTA grades it on ground truth Hermes supplies live traffic & reward Ricardo routes by comparative advantage a cheaper, better Hermes more traffic orint sees the shift. The loop is the moat.

§04

Drop-in for power users.

Hermes speaks the OpenRouter contract. Point your base URL at Ricardo and nothing changes — except a lower bill at equal quality, and a flywheel quietly learning which model to use for your work.

  • One endpointOpenAI / OpenRouter-compatible, zero code change
  • Per-call receiptsevery routed decision is logged and auditable
  • Your λset how much quality is worth — cost-thrifty to frontier-hungry
# point Hermes at Ricardo — that's the whole integration
export HERMES_BASE_URL=https://her.fabryka.ai/v1

# every task now routes to the proven-best
# model for its modality, and is recorded
hermes run "refactor this module and run the tests"
→ code · routed: qwen 35b · $0.000
hermes browse "find & summarise the Q3 filing"
→ web · routed: claude-opus · $0.041
✓ outcome recorded → flywheel
§05

Go to market.

Land cheap, prove instantly, expand on usage. Product-led and dev-first — adoption is one line of config, because the value is a receipt, not a pitch. Low CAC by design.

PHASE 1 · LAND

Ride the Hermes wave

Hermes is the #1 open agent (180k ⭐); its users already overpay across providers. We’re the drop-in routing layer — riding its distribution instead of buying our own. Dogfood first, then open-source the client.

PHASE 2 · EXPAND

Every OpenRouter user

The endpoint is OpenRouter-compatible — the whole developer base switches in one line. Self-serve, usage-based. The savings receipt is the entire sales deck.

PHASE 3 · UP-MARKET

Teams with real bills

A team at $60k/mo on inference leaves ~$300k/yr on the table. Bottoms-up dev installs become org procurement; per-call receipts are the CFO artifact.

Channels · owned, ~zero CAC

  • orintmarket intelligence — draws exactly the people who care about cost & routing
  • CodeSOTAcredibility + a publishable benchmark (HermesBench) for inbound
  • The papertechnical inbound, researchers, recruiting
  • Open sourcebottoms-up dev distribution — Nous’s exact playbook
  • Riding Hermesthe fastest-growing open agent’s growth is our growth
Aligned pricing: we earn only when we save you money. Usage-based — a cut of the proven savings. Early traffic is subsidized to seed the data flywheel: CAC for the moat, which inverts as the routing table heats up.
§06

Access & the round.

For power users

Run Hermes with outcome-routing on your own workloads. Early access is opening now.

For clients

Better agent outcomes at lower cost, with per-call receipts you can audit. Pilots welcome.

For investors

A self-improving routing flywheel on owned multimodal traffic. We're raising. Let's talk.