The bill arrives at the end of the month
You ship a bot. Claude responds well, the client is happy. The first month goes by quietly. Then you open Anthropic billing: $200+ for traffic from a small café.
You dig into the logs. 60,000 requests over a month. "Are you open on Sundays?", "What's your address?", "Is delivery free?" — thousands of times. Every single one routed through Claude Sonnet with a 400-token system prompt.
This isn't a model cost problem. It's an architecture problem: a uniform model serving fundamentally non-uniform load.
Request complexity in a business bot isn't normally distributed — it's bimodal. A long tail of FAQ requests where Claude's power is completely wasted, and a narrow spike of complaints, edge cases, and generation tasks where it's actually needed. If you don't split these flows, you're paying for cloud inference where a local model would have been fine.
Why "just use Ollama" doesn't work
The obvious fix: move everything to Ollama. Models like llama3.1:8b or mistral:7b on a GPU give acceptable quality for simple tasks at zero variable cost.
The problem is that open-source models degrade in specific scenarios: long context (>3K tokens), strict output format requirements, multi-step reasoning. In a bot with RAG, these come up regularly. Moving everything to Ollama means unpredictable quality exactly where the client will notice.
The other take — "only pay Claude for complex requests" — is directionally right, but what counts as "complex"? Without a formal classifier, this turns into manually maintained conditionals in code that don't scale and break with every traffic shift.
You need a router: a component that decides which model handles the request before it goes anywhere.
Architecture: one interface, two tiers
The core requirement: the router must be invisible from the outside. From the FastAPI endpoint's perspective, there's a single llm_client.complete() that always returns a response. Where the request went is an implementation detail.
There's no load balancing between Ollama and Claude — there's a hierarchy. Ollama is the first tier, Claude is escalation. Escalation happens in three cases: the router decided so, Ollama returned an invalid response, or Ollama is unavailable.
The router: asymmetry of error cost
The router isn't a binary "simple/complex" classifier. The correct framing: minimize the expected cost of a routing error.
Error toward Ollama for a complex request: quality degradation, retry, potentially a broken conversation. In B2B — real business consequences for the client.
Error toward Claude for a simple request: a few cents of overspending.
The asymmetry is obvious. It produces a concrete rule: when in doubt, go to the cloud. This isn't being conservative — it's correctly accounting for the real cost of each error type.
Decision logic is two-layered.
Hard rules fire first and override any scoring. Complaints, legal context, generation tasks — always Claude. A clean request for opening hours or an address — always Ollama.
Soft scoring kicks in when hard rules don't fire. Factors: RAG context volume, format requirements, message length, the number of consecutive clarifying questions in the dialog (a rising count signals that previous answers weren't solving the problem).
The routing threshold is deliberately shifted:
target = (
ModelTarget.CLOUD
if signal.score > 0.35 or signal.confidence < 0.6
else ModelTarget.LOCAL
)
confidence < 0.6 — if the router isn't confident enough in its classification, the request goes to Claude. Explicit codification of the asymmetry.
Three things that break in production
Ollama's formatted output. Even with an explicit instruction to return JSON, llama3.1:8b periodically wraps it in a markdown code block or adds surrounding text. In production this isn't an edge case — it's a regular scenario. Solution: parsing with multiple fallback patterns, and after two failed attempts — automatic escalation to Claude. Not three retries, not four: a second retry on Ollama is slower than a single Claude call.
Context window under load. Ollama allocates num_ctx on the first request to a model and doesn't adjust it dynamically within a session. If the service started with the default num_ctx=2048 and a request arrives with 3,500 tokens of RAG context — the context gets silently truncated. No error, just a response about nothing. num_ctx must be passed explicitly on every request, with headroom above the actual volume.
Latency degradation during spikes. On a single GPU, Ollama doesn't parallelize requests — it queues them. During sudden traffic spikes, p95 latency grows linearly, the router doesn't know this, and keeps routing locally. You need a circuit breaker on latency, not just errors: when the p95 threshold is exceeded, all traffic temporarily goes to Claude regardless of classification. This needs to be a separate component — don't add the condition into the router logic, or the breaker's state gets tangled up with classification.
Observability
Without proper logging, the system is opaque: you see costs but don't understand what's driving them.
The key is logging not just routed_to, but also actual_model. These fields diverge during escalation. Escalation frequency is the primary health metric for the router: if it's growing, either the traffic pattern changed, the local model degraded, or the thresholds need recalibration.
The second important signal is a proxy quality metric. Not manual response labeling — downstream behavior: if a user asks a follow-up question within two minutes of a response, the first answer probably didn't solve the problem. Measurable with zero additional infrastructure.
The numbers
Real case: a Telegram bot for a café, one month of observation after rolling out the router.
Request typeTraffic shareModelFAQ, hours, address, prices61%OllamaMenu clarifications, ingredients18%OllamaEdge cases, complaints12%ClaudeRAG over documents, generation9%Claude
Cost before: $234/month. After: $47/month. Quality by client complaints — unchanged: the scenarios that used to go to Claude still go to Claude.
The 80% cost reduction isn't the goal of the architecture. It's a side effect of making request cost a function of complexity rather than a constant. The real gain: the system became legible. Now you can see what each interaction type costs and know exactly what to do about it when traffic grows.


