Originally published at theculprit.ai/blog/anthropic-prompt-caching-90-percent.
LLM costs in production scale faster than the post-mortem of the demo bill suggests they will.
The shape of the problem: you ship a feature that calls Claude on every meaningful event. The first month the bill is rounding error and nobody looks at it. The second month a customer's traffic ramps and the line item is suddenly five percent of revenue. The third month your finance person sends a polite Slack about whether this is "a real cost trend or a one-time spike," and everyone on the engineering team has to defend an architecture decision they made eight weeks ago when the bill was rounding error.
You can reduce this. Not by being clever about how you call the model — by being clever about what's constant across your calls. Anthropic's prompt caching, in our case, takes the per-RCA input cost from full-rate to one-tenth of full-rate on a 90%+ cache-hit rate. That's not a hypothetical; it's what we measure in production, and the math is simple enough to walk through here so you can run the numbers on your own pipeline.
The pricing structure
Anthropic publishes four price points per model. For Claude Haiku 4.5, the model we run as the default for incident root-cause analysis, those points are (verified from the Anthropic API docs):
| Token category | Haiku 4.5 |
|---|---|
| Base input | $1.00 per million tokens |
| Cache write (5-minute TTL) | $1.25 per million tokens |
| Cache read | $0.10 per million tokens |
| Output | $5.00 per million tokens |
Two things to read from that table:
- Cache read is 10x cheaper than base input. Same tokens in the request body, ten percent of the cost — if you can get them into the cache.
- Cache write is 25% more expensive than base input. First time you send a cached segment, you're paying a small premium so the next request can pay the discount. The math only pays off if you call the model with the same cached segment more than ~1.25 times on average within the 5-minute TTL window.
That second point is the one most teams miss. If your call pattern is "one-shot, cold cache every time," prompt caching makes you slightly worse off. The win comes from repeatable structure across calls.
What's actually cacheable in an RCA call
A typical RCA call has five sources of tokens:
- System prompt. Defines the role ("you are an SRE analyzing an incident"), the JSON schema for the response, and any guardrails. Identical across every call across every tenant. Maybe 800-1500 tokens depending on how rigorous your schema is.
- Retrieval context ("here are 3 prior incidents from this same service that resolved similarly"). Static for a few minutes within a Batch run on one tenant + service. Maybe 400-800 tokens depending on how aggressive the retrieval is.
- Per-incident events ("event 1 at 14:32:01: ConnectionPoolExhausted...; event 2 at 14:32:04: ..."). Unique to the incident under analysis. Cannot be cached across incidents. Typically 1500-3000 tokens.
- Per-incident metadata (incident ID, service ID, severity). Tiny but unique.
- Output tokens. The model's response. Cost is fixed at the output rate; caching doesn't apply.
Sources 1 and 2 are cacheable. Sources 3 and 4 are not. Source 5 is irrelevant.
In our distribution, sources 1 + 2 are roughly 70-80% of the input tokens for a typical RCA call. Cache them at 0.10 per million; pay full rate on the remaining 20-30%; total input cost drops by about 60-70% from the naive baseline. The "90%" headline number rounds up because we measure cache hits, not total cost, and within the cached portion the savings really are 90%.
The two-segment trick
Anthropic's API takes a cache_control marker per segment in your system array. Each marker is an independent breakpoint — the cache stores tokens up to the marker. If you have two segments, the API caches each one separately:
// Conceptual shape — see rca-prompt.ts for the exact code we run.
const system = [
{
type: 'text',
text: SYSTEM_PROMPT, // ~1200 tokens, identical everywhere
cache_control: { type: 'ephemeral' },
},
{
type: 'text',
text: priorIncidentsContext, // ~600 tokens, per-tenant per-service
cache_control: { type: 'ephemeral' },
},
];
Why two segments instead of one? Because the cache lifetime for those two pieces is different.
The system prompt almost never changes — every RCA call across every tenant hits the cache. Cache read essentially every time after the first call.
The retrieval context (prior similar incidents for this service) changes whenever a new incident on that service resolves and shifts the top-K. Within a single Batch run on one tenant + service, repeats hit the cache. Across tenants, never.
If you stuff both into a single segment, the moment the retrieval context for tenant A changes, tenant B's hit rate drops too — because the one combined segment hashes differently. Two segments → independent cache lifetimes → tenant A's churn doesn't punish tenant B.
The order matters. Anthropic caches up to each marker, so the more-static segment must come first. If you put per-tenant retrieval first and the static system prompt second, the static prompt's cache key now includes the per-tenant content above it; you've just made the most cacheable segment uncacheable across tenants.
What kills the cache
In rough order of frequency:
The 5-minute ephemeral TTL. A cached segment expires 5 minutes after its last write. If your call pattern is bursty (RCA calls cluster around incidents, then quiet for an hour), a long quiet period will let every cached segment expire and you'll pay cache write (slightly above base rate) on the next batch. Spread your calls if you can; if you can't, accept that the first few calls after a quiet period pay full freight.
Whitespace drift. If you concatenate the system prompt with
in one place and
in another, you have two distinct cache keys. The cache hashes the literal token sequence, not the semantic meaning. Pick one separator and lint for it.
Trailing dynamic content. A common bug: someone adds a timestamp to the "system prompt" — Today's date is 2026-05-08T14:32:01Z — for "context". The timestamp changes every call. Now nothing cached after the timestamp survives. Keep dynamic content out of cached segments entirely; pass it as a user-message turn instead.
Schema version churn. If you're iterating on your JSON output schema (a normal early-product activity), every schema edit invalidates every cached system prompt. The cost of "tuning the schema" is partly paid in cache misses. Plan for one or two big schema-stabilization sweeps rather than continuous tweaks.
The production numbers
Per-RCA cost on Haiku 4.5 with prompt caching enabled, Batch API (which itself adds another 50% off both input and output), 4000 input tokens + 500 output tokens, ~75% of input tokens cached:
- Input (cached portion, 3000 tokens × 0.5 batch × 0.10 cache-read): $0.00015
- Input (uncached portion, 1000 tokens × 0.5 batch × 1.00 base): $0.00050
- Output (500 tokens × 0.5 batch × 5.00): $0.00125
- Cache write amortized (1200 tokens × 0.5 batch × 1.25, divided across ~30 cache hits per write cycle): ~$0.00003
Total: ~$0.0033 per RCA call.
Without caching, same call shape, real-time API: input would be ~$0.004, output would be ~$0.0025, total ~$0.0065. Caching alone gets us a ~50% reduction on input. Batch API gets us another 50% on top. Caching + Batch is what makes the per-RCA cost sit around a third of a cent.
A cluster of typical incidents at this rate is the difference between "a flat-rate pricing model that works" and "a flat-rate pricing model with worst-case unit economics that don't." We document this in our pricing rationale — the discipline isn't a marketing posture, it's the load-bearing constraint that lets the price stay flat.
Where this generalizes
If you're calling Claude on a per-event or per-incident schedule, the structure above applies to whatever shape your calls take. The questions to answer:
- What in your prompt is identical across every call? That's segment 1. If the answer is "nothing," your prompt isn't designed for caching yet — find the constants. There almost always are some.
- What is per-tenant or per-context but reused within a short window? That's segment 2. Common cases: retrieval context, customer-specific style guidelines, account metadata.
- What's truly per-call? Goes in the user message turn, never in the cached system block.
- Is your call rate above the break-even threshold? If you call the same cached prompt fewer than ~1.25 times per 5-minute window, you'll lose money on caching. For a noisy production system this is rarely the bottleneck, but for a low-volume tool it can be.
The pattern doesn't apply only to Claude. OpenAI's prompt caching follows similar economics with different numbers; Gemini's context caching has a different TTL but the same "what's static, what's dynamic" decomposition. The work of setting up your prompts so the static parts cluster at the front pays off across every model that supports caching, which is increasingly all of them.
A single test
If you're considering whether prompt caching applies to your pipeline, the cheapest first measurement is also the most informative one: count how many tokens of your typical request are byte-for-byte identical to the previous request. Not "semantically the same" — literally identical. If the answer is more than 50%, you're leaving money on the table; ship cache_control on the static prefix and watch the input-cost line item drop on the next billing day.
If the answer is less than 20%, your prompts are designed for context, not for repetition, and caching probably won't help much without a structural rewrite. Either way, knowing the number is a one-hour exercise that beats arguing about whether caching is worth the complexity.
The architecture above is what makes Culprit's flat-rate pricing economically defensible — RCA calls cluster around incidents, the system prompt and retrieval context dominate the input tokens, and the cache hit rate sits comfortably above 90%. Same primitives, different vertical: if you're shipping LLM features into production at any scale where the bill is starting to matter, this is the lowest-effort high-yield refactor you have available.




