The router that proves its choice

Every task,
priced by outcome.

Ricardo routes every model call to the comparative-advantage choice and proves it on whether the work actually succeeded, not on an opinion. It runs on real agent traffic — Hermes, the open agent by Nous Research, doing code, web, vision and browser work.

Request access See the evidence

Fig. 1 — Cost · Quality frontierBizAgentBench v1

Measured · 50-task spread · judge-scored2026-06-10

real tasks captured

to find the better route

quality uplift, routed vs default

∞

data flywheel, owned

▸

How it works.

STEP 1

Your request comes in

“Refactor this module.” “Summarise the filing.” Any task, any modality — through one drop-in endpoint.

→

STEP 2

Ricardo routes it

To the model that does this task best per dollar — chosen on whether the work actually succeeded, not on a guess.

→

STEP 3

Same result, smaller bill

The outcome you’d pay frontier prices for — often at a fraction of the cost. With a per-call receipt you can audit.

▸

See it choose.

One real task: “write is_valid_ipv4 — reject leading zeros and out-of-range octets.” We run each model’s actual code on the tricky case and watch the bill:

# CODE · is_valid_ipv4("01.2.3.4") → must be False · 18 hidden tests

deepseek-v4-flash → ✓ passes  $0.00012

claude-opus-4.8   → ✓ passes  $0.00368  # same answer · ~127× / token

qwen-35b · local  → ✓ passes  ~$0  → all correct · route the cheap one

# WEB · agentic research · execution-graded

qwen-35b · local  ✗ outcome 0.09  "couldn't gather usable info"

claude-opus-4.8   ✓ outcome 0.60  # +0.175 → escalate

Same models, opposite call. On code, cheap is already right — route the $0.00012 model, not the ~127×-pricier one (frontier is waste). On web, that same model fails (0.09) — so it escalates. The right call flips per task; a static “always cheap” or “always frontier” rule loses on half. We prove each call by re-running the work — receipt attached.

✶

Same task. Two bills. Your choice.

Equivalent intelligence got ~700× cheaper since 2023; frontier intelligence got ~300× pricier. The space between is the largest cost–quality spread in business history — the routing decision every team is making, knowingly or not.

Fig. 2 — Indexed cost · same capability vs frontierlog scale · Jan 2023 = 100×

The gap is what Ricardo routes. measured here · one code task, all tests pass · deepseek-flash $0.00003 vs Opus $0.00426 — 142× the bill, identical result.

✦

Why now.

Three forces, all pointing at routing — and all accelerating.

700×cheaper

The widest spread in history

Equivalent capability fell ~700× since 2023 while frontier rose ~300× — the largest cost–quality gap commodity software has seen.

71%win/tie vs cloud

Cheap is now good enough

The best local model now wins or ties cloud 71% of the time, up from 23% in 2023. The good-enough-cheap zone widens every quarter.

+87%Mar→May ’26

The agents we route are surging

Nous Hermes tokens on OpenRouter grew 8.8B → 16.5B / month in three months (orint data) — the open agent our deployment routes is itself taking off.

✦

We start where the R&D ends.

The hard, slow layer — capable open models, agents, training infrastructure — took Nous Research ~3 years and a $1B raise to mature. It’s done, and it’s open. We don’t rebuild it; we route it, by outcome. Years of foundation; months to value.

The foundation · the open ecosystem · ~3 years of R&D

2023open fine-tunes

→

2024free 405B on OpenRouter

→

2025training infra + $1B

→

2026#1 open agent

Ricardo · applied from day one route → prove → save. Months, not years.

↑ we enter here — on the mature foundation, straight at the applied value

§01

Routing has an opinion now.

Every router today sorts by price, latency, uptime. None of them has a view on whether the answer was any good. So nobody running an agent gets the best result for this task — they get a result from whichever model they happened to pick.

Ricardo is the opinion layer. Named for David Ricardo's comparative advantage: even if one model is absolutely best at everything, you still win by routing a task to the cheaper model that holds the advantage there. Specialisation beats a single frontier model on price and quality. The idea is two hundred years old; the application is new.

The catch every benchmark gets wrong:

A long agentic task can't be graded from its transcript. We measured it — the LLM judge scored a batch of web tasks 0.50; their real execution outcome was 0.09. The judge is polite. The work failed.

Hermes already knows what worked — retries, tool errors, whether the task completed. That ground-truth signal, which standalone routers can't reach, is the reward Ricardo learns from.

§02

Evidence, not slideware.

Routing table — web tasks · incumbent = local qwen 35b (≈$0)
Model	Quality	Uplift	$ / call	Verdict
qwen 35b local	0.43	—	$0	free default
claude-opus-4.8	0.60	+0.175	$0.043	escalate ⬩ 5/8
minimax-m2.5	0.45	+0.03	$0.001	noise
glm-4.6	0.23	−0.20	$0.002	dominated

One experiment, $0.37, run against real captured traffic. It found that only a frontier model earns its escalation on web — and a single dial (λ, quality-points-per-dollar) turns the whole Pareto choice from cost-saving to quality-hungry.

Fig. 3 — Routing boundsBizAgentBench · n=14 common tasks

Routing beats the best single model — comparative advantage, measured2026-06-10

BizAgentBench v1 — our own benchmark of business-agent tasks, every run graded by a fixed independent judge. Routing per task to the cheapest model that clears the quality bar scores 0.893 — above the best single model at 0.840 and +44% over the gpt-4.1-mini default at 0.618. Not just a cost story: per-task comparative advantage buys quality no single model has.

▸ Read the paper — Ricardo: Outcome-Grounded Routing (PDF)

158real Hermes trajectories, captured in production
Ground truthreward from execution — did it run, retry, complete
Liveshadow router deployed, logging every call
Reproducibleweb tasks replayed deterministically from recorded results — a benchmark from real usage

◆

“The moat” is a rotating meme.

2024finetuning is the moat

→

2025evals are the moat

→

2026model router is the moat

→

the truthnone alone is — the loop is

A rotating shell game — and the skeptics are right.

Each is a real component of an AI system; none, alone, is a moat. A router is copyable in a weekend (RouteLLM, Not Diamond, every gateway). If our pitch were “we built a router,” the right response is a shrug.

So it isn’t. The moat was never one layer — it’s the loop across all three, on data nobody has: evals (CodeSOTA) grade it, the router (Ricardo) acts on it, owned traffic (Hermes) feeds it. The shell game ends when you stop chasing the layer and start owning the compounding data between them.

Not a slogana number you can run: same tests pass, ~127× the price per token
Not opinionreward from execution — did the work succeed — not an LLM judge
Not buyablethe corpus exists only from real owned traffic — no competitor can buy or clone it

§03

The moat is the loop, not the model.

The mechanism — shadow A/B, score, promote-when-better — is industry-standard. The edge is three things no one else combines: it is multimodal, its judge is calibrated to human votes, and it runs on traffic we own — every task our deployment runs becomes a private (prompt → model → verified-outcome) datapoint nobody can scrape, synthesize, or buy.

01 · OWN THE AGENT

Hermes works

Real tasks across code, web, vision, browser — generating real execution signal, all day.

→

02 · OWN THE ROUTER

Ricardo decides

Per task, per modality: free local by default, escalate to a frontier model only where the outcome is worth the cost.

→

03 · OWN THE SIGNAL

Outcome feeds back

Did it succeed? That ground-truth reward sharpens the table — better routing, lower cost, a better agent, more traffic.

◆

Three products. One loop.

Ricardo isn't a cold start. Three live products already produce exactly what routing-by-outcome needs — the market map, the verifiers, and the traffic — and each feeds the next.

ORINT · THE MARKET MAP

34Ttokens / week

We see the whole market

750 models · 577 days of OpenRouter. 64% of volume is cheap open models; the 36% Western frontier keeps the revenue. The 9.6× price gap — and where volume is migrating — is the routing opportunity, quantified.

ort.fabryka.ai →

→

CODESOTA · THE VERIFIERS

17benchmarks + RL envs

Ground truth, per task

Executable RL-verifier environments — audio, voice, code — plus a competitive sweep of 28 eval vendors. Which model actually wins which task, as runnable checks. The grading layer Ricardo routes on, already built.

codesota.com →

→

HERMES · THE TRAFFIC

158+real tasks & rising

Owned outcome signal

A live agent riding the open-model wave orint tracks — emitting real execution outcomes (ran, retried, completed). A private (prompt → model → outcome) flywheel no router can buy.

her.fabryka.ai →

orint sees the menu & the spread → CodeSOTA grades it on ground truth → Hermes supplies live traffic & reward → Ricardo routes by comparative advantage → a cheaper, better Hermes → more traffic → orint sees the shift. The loop is the moat.

§04

Drop-in for power users.

Hermes speaks the OpenRouter contract. Point your base URL at Ricardo and nothing changes — except a lower bill at equal quality, and a flywheel quietly learning which model to use for your work.

One endpointOpenAI / OpenRouter-compatible, zero code change
Per-call receiptsevery routed decision is logged and auditable
Your λset how much quality is worth — cost-thrifty to frontier-hungry

# point Hermes at Ricardo — that's the whole integration

export HERMES_BASE_URL=https://her.fabryka.ai/v1

# every task now routes to the proven-best

# model for its modality, and is recorded

hermes run "refactor this module and run the tests"

  → code   · routed: qwen 35b      · $0.000

hermes browse "find & summarise the Q3 filing"

  → web    · routed: claude-opus   · $0.041

  ✓ outcome recorded → flywheel

§05

Go to market.

Land cheap, prove instantly, expand on usage. Product-led and dev-first — adoption is one line of config, because the value is a receipt, not a pitch. Low CAC by design.

PHASE 1 · LAND

Ride the Hermes wave

Hermes is the #1 open agent (180k ⭐); its users already overpay across providers. We’re the drop-in routing layer — riding its distribution instead of buying our own. Dogfood first, then open-source the client.

PHASE 2 · EXPAND

Every OpenRouter user

The endpoint is OpenRouter-compatible — the whole developer base switches in one line. Self-serve, usage-based. The savings receipt is the entire sales deck.

PHASE 3 · UP-MARKET

Teams with real bills

A team at $60k/mo on inference leaves ~$300k/yr on the table. Bottoms-up dev installs become org procurement; per-call receipts are the CFO artifact.

Channels · owned, ~zero CAC

orintmarket intelligence — draws exactly the people who care about cost & routing
CodeSOTAcredibility + a publishable benchmark (HermesBench) for inbound
The papertechnical inbound, researchers, recruiting
Open sourcebottoms-up dev distribution — Nous’s exact playbook
Riding Hermesthe fastest-growing open agent’s growth is our growth

Aligned pricing: we earn only when we save you money. Usage-based — a cut of the proven savings. Early traffic is subsidized to seed the data flywheel: CAC for the moat, which inverts as the routing table heats up.

§06

Access & the round.

For power users

Run Hermes with outcome-routing on your own workloads. Early access is opening now.

For clients

Better agent outcomes at lower cost, with per-call receipts you can audit. Pilots welcome.

For investors

A self-improving routing flywheel on owned multimodal traffic. We're raising. Let's talk.

Request access Talk to us about the round

Every task,priced by outcome.