Anthropic prompt caching cut our RCA cost by 90%

Dev.to / 5/9/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

Production LLM costs can escalate quickly when teams make the same model call for every incident event, turning a “demo-level” spend into a meaningful share of revenue.
Anthropic prompt caching can sharply reduce the per-request input cost by reusing cached prompt segments, achieving a 90%+ cache-hit rate and cutting RCA cost by 90% in the author’s case.
Anthropic’s pricing shows cache reads are 10× cheaper than base input, while cache writes cost 25% more, so the approach only pays off when the same prompt segments repeat multiple times within the cache TTL (5 minutes in the described setup).
Teams should avoid “one-shot, cold cache every time” call patterns, and instead design RCA calls so that constant parts (e.g., system prompt and stable token structure) remain identical across requests.
The article explains what is typically cacheable in an RCA flow (such as the system prompt, response schema, and other repeated instruction/preamble components) to make caching effective.

Originally published at theculprit.ai/blog/anthropic-prompt-caching-90-percent.

LLM costs in production scale faster than the post-mortem of the demo bill suggests they will.

The shape of the problem: you ship a feature that calls Claude on every meaningful event. The first month the bill is rounding error and nobody looks at it. The second month a customer's traffic ramps and the line item is suddenly five percent of revenue. The third month your finance person sends a polite Slack about whether this is "a real cost trend or a one-time spike," and everyone on the engineering team has to defend an architecture decision they made eight weeks ago when the bill was rounding error.

You can reduce this. Not by being clever about how you call the model — by being clever about what's constant across your calls. Anthropic's prompt caching, in our case, takes the per-RCA input cost from full-rate to one-tenth of full-rate on a 90%+ cache-hit rate. That's not a hypothetical; it's what we measure in production, and the math is simple enough to walk through here so you can run the numbers on your own pipeline.

The pricing structure

Anthropic publishes four price points per model. For Claude Haiku 4.5, the model we run as the default for incident root-cause analysis, those points are (verified from the Anthropic API docs):

Token category	Haiku 4.5
Base input	$1.00 per million tokens
Cache write (5-minute TTL)	$1.25 per million tokens
Cache read	$0.10 per million tokens
Output	$5.00 per million tokens

Two things to read from that table:

Cache read is 10x cheaper than base input. Same tokens in the request body, ten percent of the cost — if you can get them into the cache.
Cache write is 25% more expensive than base input. First time you send a cached segment, you're paying a small premium so the next request can pay the discount. The math only pays off if you call the model with the same cached segment more than ~1.25 times on average within the 5-minute TTL window.

That second point is the one most teams miss. If your call pattern is "one-shot, cold cache every time," prompt caching makes you slightly worse off. The win comes from repeatable structure across calls.

What's actually cacheable in an RCA call

A typical RCA call has five sources of tokens:

System prompt. Defines the role ("you are an SRE analyzing an incident"), the JSON schema for the response, and any guardrails. Identical across every call across every tenant. Maybe 800-1500 tokens depending on how rigorous your schema is.
Retrieval context ("here are 3 prior incidents from this same service that resolved similarly"). Static for a few minutes within a Batch run on one tenant + service. Maybe 400-800 tokens depending on how aggressive the retrieval is.
Per-incident events ("event 1 at 14:32:01: ConnectionPoolExhausted...; event 2 at 14:32:04: ..."). Unique to the incident under analysis. Cannot be cached across incidents. Typically 1500-3000 tokens.
Per-incident metadata (incident ID, service ID, severity). Tiny but unique.
Output tokens. The model's response. Cost is fixed at the output rate; caching doesn't apply.

Sources 1 and 2 are cacheable. Sources 3 and 4 are not. Source 5 is irrelevant.

In our distribution, sources 1 + 2 are roughly 70-80% of the input tokens for a typical RCA call. Cache them at 0.10 per million; pay full rate on the remaining 20-30%; total input cost drops by about 60-70% from the naive baseline. The "90%" headline number rounds up because we measure cache hits, not total cost, and within the cached portion the savings really are 90%.

The two-segment trick

Anthropic's API takes a cache_control marker per segment in your system array. Each marker is an independent breakpoint — the cache stores tokens up to the marker. If you have two segments, the API caches each one separately:

// Conceptual shape — see rca-prompt.ts for the exact code we run.
const system = [
  {
    type: 'text',
    text: SYSTEM_PROMPT,                    // ~1200 tokens, identical everywhere
    cache_control: { type: 'ephemeral' },
  },
  {
    type: 'text',
    text: priorIncidentsContext,            // ~600 tokens, per-tenant per-service
    cache_control: { type: 'ephemeral' },
  },
];

Why two segments instead of one? Because the cache lifetime for those two pieces is different.

The system prompt almost never changes — every RCA call across every tenant hits the cache. Cache read essentially every time after the first call.

The retrieval context (prior similar incidents for this service) changes whenever a new incident on that service resolves and shifts the top-K. Within a single Batch run on one tenant + service, repeats hit the cache. Across tenants, never.

If you stuff both into a single segment, the moment the retrieval context for tenant A changes, tenant B's hit rate drops too — because the one combined segment hashes differently. Two segments → independent cache lifetimes → tenant A's churn doesn't punish tenant B.

The order matters. Anthropic caches up to each marker, so the more-static segment must come first. If you put per-tenant retrieval first and the static system prompt second, the static prompt's cache key now includes the per-tenant content above it; you've just made the most cacheable segment uncacheable across tenants.

What kills the cache

In rough order of frequency:

The 5-minute ephemeral TTL. A cached segment expires 5 minutes after its last write. If your call pattern is bursty (RCA calls cluster around incidents, then quiet for an hour), a long quiet period will let every cached segment expire and you'll pay cache write (slightly above base rate) on the next batch. Spread your calls if you can; if you can't, accept that the first few calls after a quiet period pay full freight.

Whitespace drift. If you concatenate the system prompt with in one place and in another, you have two distinct cache keys. The cache hashes the literal token sequence, not the semantic meaning. Pick one separator and lint for it.

Trailing dynamic content. A common bug: someone adds a timestamp to the "system prompt" — Today's date is 2026-05-08T14:32:01Z — for "context". The timestamp changes every call. Now nothing cached after the timestamp survives. Keep dynamic content out of cached segments entirely; pass it as a user-message turn instead.

Schema version churn. If you're iterating on your JSON output schema (a normal early-product activity), every schema edit invalidates every cached system prompt. The cost of "tuning the schema" is partly paid in cache misses. Plan for one or two big schema-stabilization sweeps rather than continuous tweaks.

The production numbers

Per-RCA cost on Haiku 4.5 with prompt caching enabled, Batch API (which itself adds another 50% off both input and output), 4000 input tokens + 500 output tokens, ~75% of input tokens cached:

Input (cached portion, 3000 tokens × 0.5 batch × 0.10 cache-read): $0.00015
Input (uncached portion, 1000 tokens × 0.5 batch × 1.00 base): $0.00050
Output (500 tokens × 0.5 batch × 5.00): $0.00125
Cache write amortized (1200 tokens × 0.5 batch × 1.25, divided across ~30 cache hits per write cycle): ~$0.00003

Total: ~$0.0033 per RCA call.

Without caching, same call shape, real-time API: input would be ~$0.004, output would be ~$0.0025, total ~$0.0065. Caching alone gets us a ~50% reduction on input. Batch API gets us another 50% on top. Caching + Batch is what makes the per-RCA cost sit around a third of a cent.

A cluster of typical incidents at this rate is the difference between "a flat-rate pricing model that works" and "a flat-rate pricing model with worst-case unit economics that don't." We document this in our pricing rationale — the discipline isn't a marketing posture, it's the load-bearing constraint that lets the price stay flat.

Where this generalizes

If you're calling Claude on a per-event or per-incident schedule, the structure above applies to whatever shape your calls take. The questions to answer:

What in your prompt is identical across every call? That's segment 1. If the answer is "nothing," your prompt isn't designed for caching yet — find the constants. There almost always are some.
What is per-tenant or per-context but reused within a short window? That's segment 2. Common cases: retrieval context, customer-specific style guidelines, account metadata.
What's truly per-call? Goes in the user message turn, never in the cached system block.
Is your call rate above the break-even threshold? If you call the same cached prompt fewer than ~1.25 times per 5-minute window, you'll lose money on caching. For a noisy production system this is rarely the bottleneck, but for a low-volume tool it can be.

The pattern doesn't apply only to Claude. OpenAI's prompt caching follows similar economics with different numbers; Gemini's context caching has a different TTL but the same "what's static, what's dynamic" decomposition. The work of setting up your prompts so the static parts cluster at the front pays off across every model that supports caching, which is increasingly all of them.

A single test

If you're considering whether prompt caching applies to your pipeline, the cheapest first measurement is also the most informative one: count how many tokens of your typical request are byte-for-byte identical to the previous request. Not "semantically the same" — literally identical. If the answer is more than 50%, you're leaving money on the table; ship cache_control on the static prefix and watch the input-cost line item drop on the next billing day.

If the answer is less than 20%, your prompts are designed for context, not for repetition, and caching probably won't help much without a structural rewrite. Either way, knowing the number is a one-hour exercise that beats arguing about whether caching is worth the complexity.

The architecture above is what makes Culprit's flat-rate pricing economically defensible — RCA calls cluster around incidents, the system prompt and retrieval context dominate the input tokens, and the cache hit rate sits comfortably above 90%. Same primitives, different vertical: if you're shipping LLM features into production at any scale where the bill is starting to matter, this is the lowest-effort high-yield refactor you have available.

Black Hat USA

AI Business

The Semantic Airgap: Why "Hinglish" is the Ultimate Zero-Day for Voice Agents

Dev.to

Build an AI-Powered Money Printing Machine

Dev.to

A protocol for auditing AI agent harnesses

Dev.to

Anthropic says it hit a $30 billion revenue run rate after 'crazy' 80x growth

VentureBeat

Anthropic prompt caching cut our RCA cost by 90%

Key Points

The pricing structure

What's actually cacheable in an RCA call

The two-segment trick

What kills the cache

The production numbers

Where this generalizes

A single test

Related Articles

Black Hat USA

The Semantic Airgap: Why "Hinglish" is the Ultimate Zero-Day for Voice Agents

Build an AI-Powered Money Printing Machine

A protocol for auditing AI agent harnesses

Anthropic says it hit a $30 billion revenue run rate after 'crazy' 80x growth

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer