LLM Semantic Caching: The 95% Hit Rate Myth (and What Production Data Actually Shows)

Dev.to / 4/5/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical Usage

Key Points

  • The article argues that the widely cited “95% cache hit rate” marketing claim is misleading because published production hit rates for semantic caching are typically in the 20–45% range, with the 95% figure referring to match accuracy rather than how often requests hit the cache.
  • It distinguishes exact caching (hashing the full prompt and generation parameters for an unambiguous match) from semantic caching (matching meaning), explaining that teams should usually implement exact caching first before adding semantic logic.
  • It reports that even a 20% semantic-cache hit rate can generate meaningful savings and performance gains, such as reducing multi-second latency to under ~5ms on cached responses and cutting monthly costs on typical LLM spend.
  • It recommends prioritizing the “marginal improvement” case: add semantic caching only when the additional hit-rate/quality gains justify added complexity and operational overhead.
  • The post includes guidance on which production use cases benefit from semantic caching and which are less likely to see enough cache reuse to be worthwhile.

You opened your OpenAI dashboard this morning and felt that familiar pit in your stomach. The number was higher than last month. Again. Somebody mentioned semantic caching — "just cache the responses, cut costs by 90%." So you looked into it.

The vendor pages all say the same thing: 95% cache hit rates, 90% cost reduction, millisecond responses. Then you ran the numbers on your own traffic and the reality was different. Much different.

This post breaks down how semantic caching actually works, what the published production hit rates are (not the marketing numbers), and which use cases benefit — and which don't.

TL;DR

  1. Published production hit rates range from 20-45%, not 90-95%. The 95% number refers to accuracy of cache matches, not frequency of hits.
  2. Even a 20% hit rate saves real money — $1,000/month on a $5K LLM bill — while cutting latency from 2-5s to under 5ms on cached requests.
  3. Start with exact caching. Add semantic caching only if the marginal improvement justifies the complexity.

Exact Caching vs. Semantic Caching: Two Different Problems

Before diving into architecture, the distinction matters because most teams should start with exact caching and only add semantic caching if exact caching alone doesn't cover enough.

Exact caching

Hash the full prompt (including model name, temperature, and other parameters) with SHA-256. If the hash matches a stored request, return the cached response. Zero ambiguity — the prompt is identical, so the response is valid.

cache_key = sha256(model + prompt + str(temperature) + str(max_tokens))
cached = redis.get(cache_key)
if cached:
    return cached  # <5ms, zero LLM cost
response = call_llm(prompt)
redis.set(cache_key, response, ttl=3600)
return response

Pros: Zero false positives. Sub-millisecond lookup. Trivial to implement.
Cons: Misses rephrased duplicates. "How do I reset my password?" and "password reset help" are different hashes.

Exact caching alone catches more traffic than you'd expect. The average production app sends 15-30% identical requests — automated pipelines, retries, and users asking the same FAQ.

Semantic caching

Generate a vector embedding of the prompt, compare it via cosine similarity to stored embeddings, and return a cached response if the similarity exceeds a threshold. This catches rephrased duplicates.

embedding = embed_model.encode(prompt)  # ~2-5ms
matches = vector_db.search(embedding, threshold=0.92)
if matches:
    return matches[0].response  # <5ms total
response = call_llm(prompt)
vector_db.upsert(embedding, response, ttl=3600)
return response

Pros: Catches semantically similar requests with different wording.
Cons: Embedding generation adds 2-5ms. False positives are possible. Threshold tuning is critical and use-case dependent.

The 95% Myth: What the Numbers Actually Say

The "95% cache hit rate" claim circulates across vendor marketing pages. Here's what the published data actually shows:

Source Hit Rate Context Type
Portkey (production) ~20% RAG use cases, 99% match accuracy Vendor data
EdTech platform (production) ~45% Student Q&A — high repetition Case study
GPT Semantic Cache (academic) 61-69% Controlled benchmark, curated dataset Research paper
General production estimate 30-40% Mixed traffic across use cases Industry average
Open-ended chat (production) 10-20% Unique conversations, low repetition Observed range

The 95% number, when you trace it back, almost always refers to match accuracy — meaning 95% of the time a cache returns a response, that response is correct for the query. Not that 95% of queries hit the cache. These are fundamentally different metrics.

The honest range for production semantic caching: 20-45% hit rate, depending heavily on use case.

Why academic benchmarks are misleading: Academic benchmarks test against curated datasets where similar questions are intentionally grouped. Production traffic is messier — 60-70% of real queries are genuinely unique. The 61-69% hit rates from research papers don't survive contact with production diversity.

Hit Rates by Use Case: Where Caching Works (and Doesn't)

Use Case Expected Hit Rate Why
FAQ / customer support 40-60% Users ask the same questions in slightly different ways. High repetition, bounded answer space.
Classification / labeling 50-70% Automated pipelines often send identical or near-identical inputs.
Internal knowledge base Q&A 30-45% Employees ask similar questions about policies, processes, docs.
RAG with document retrieval 15-25% Context varies per query even if questions are similar.
Open-ended chat 10-20% Conversations are unique. Multi-turn context makes each request different.
Code generation 5-15% High specificity per request. Users want varied outputs.

The pattern: bounded answer spaces with repetitive inputs cache well. Open-ended, context-dependent, or creative tasks don't.

The Threshold Problem: 0.85 vs. 0.92 vs. 0.98

The cosine similarity threshold is the most important — and most under-discussed — configuration in semantic caching. It's the knob that determines whether your cache is useful or dangerous.

  • Threshold 0.85 (aggressive): More cache hits, but higher false positive rate. "How to reset my password" might match "How to change my email" — similar intent, wrong answer. Good for FAQ-style use cases where a slightly imprecise answer is acceptable.
  • Threshold 0.92 (balanced): The sweet spot for most production use cases. Catches clear rephrasings while rejecting distinct-but-similar queries.
  • Threshold 0.98 (conservative): Almost-exact matching. Very few false positives, but you're only catching the most obvious rephrasings. At this point, exact caching captures nearly as much with zero false positive risk.

There is no universal correct threshold. It depends on the cost of a wrong answer in your application. A customer support bot returning a slightly wrong FAQ answer is tolerable. A medical advice application returning a cached answer for a different condition is dangerous.

Five Failure Modes Nobody Warns You About

1. Context-dependent queries that look identical

"What's the status?" asked by User A about Order #4521 and User B about Order #7893 will have near-identical embeddings. Without user-scoped or session-scoped cache keys, User B gets User A's order status. Cache keys must include relevant context — not just the prompt text.

2. Time-sensitive queries returning stale answers

"What's the latest pricing for GPT-5?" cached last week is wrong this week if pricing changed. TTL helps, but the right TTL varies by query type. Pricing questions need TTLs of hours. FAQ answers can cache for days. One-size-fits-all TTL is a guarantee of either stale answers or low hit rates.

3. Embedding model drift

If you update your embedding model, all previously cached embeddings become invalid. The similarity scores between old and new embeddings are meaningless. You need a cache invalidation strategy tied to your embedding model version. Most teams learn this the hard way after a model update causes a spike in incorrect cache responses.

4. Cache poisoning from bad responses

If the LLM returns a hallucinated or incorrect response and you cache it, every similar future query gets that same bad answer. The cache amplifies the error. Mitigation: add quality checks before caching (confidence scores, length validation, format checks), or let users flag cached responses as incorrect to trigger cache eviction.

5. Streaming response caching complexity

Most LLM calls use streaming (stream: true). You can't cache a streaming response mid-stream — you need to buffer the full response, then store it. On cache hit, you either return the full response instantly (breaking the streaming contract your client expects) or simulate streaming by chunking the cached response with artificial delays. Both are engineering overhead that vendors rarely mention.

The Dollar Math: What Caching Actually Saves

For a team spending $5,000/month on LLM APIs:

Hit Rate Monthly Savings
10% hits $500/month
20% hits $1,000/month
30% hits $1,500/month
45% hits $2,250/month

The savings come from two places: avoided LLM calls (the obvious one) and reduced latency (the hidden one). A cache hit returns in under 5ms instead of 2-5 seconds. For customer-facing applications, that latency improvement often matters more than the dollar savings.

The cost of running the cache itself is minimal. Embedding generation uses a small model (text-embedding-3-small at $0.02/1M tokens). Vector storage in Redis or a dedicated vector DB adds $50-200/month depending on cache size. The infrastructure cost is under 5% of the savings at even a 10% hit rate.

The Right Architecture: Layer Exact and Semantic Caching

The best approach is a two-layer cache that checks exact matches first (fast, zero risk) and falls back to semantic matching only when needed:

# Layer 1: Exact cache (sub-ms, zero false positives)
exact_key = sha256(model + prompt + params)
if exact_hit := cache.get(exact_key):
    return exact_hit

# Layer 2: Semantic cache (2-5ms, threshold-gated)
embedding = embed(prompt)
semantic_hit = vector_db.search(embedding, threshold=0.92)
if semantic_hit:
    return semantic_hit.response

# Cache miss: call the LLM
response = call_llm(prompt)

# Write to both layers
cache.set(exact_key, response, ttl=3600)
vector_db.upsert(embedding, response, ttl=3600)

return response

The average app we onboard discovers that 18% of requests are exact duplicates on day one — before semantic matching even kicks in.

Cache backends matter less than you'd think. In-memory works for single-instance proxies. Redis works for distributed deployments. Dedicated vector databases (Qdrant, Pinecone) are worth it only if your cache exceeds 1M entries — below that, Redis with vector search is sufficient and simpler to operate.

Start With Measurement, Not Implementation

The most common mistake: building a caching layer before understanding what your traffic looks like. You might spend two weeks implementing semantic caching only to discover that your traffic is 90% unique, context-dependent queries with a 12% hit rate ceiling.

Measure first:

  1. Log all prompts for a week. Hash them. Count exact duplicates. That's your floor.
  2. Sample 1,000 requests. Generate embeddings. Cluster them. Count how many fall within a 0.92 similarity threshold. That's your ceiling.
  3. Estimate savings. Floor hit rate × monthly LLM spend = guaranteed savings. Ceiling hit rate × monthly spend = maximum possible savings. If both numbers are under $200/month, caching isn't worth the engineering effort.

If both numbers justify the effort, start with exact caching only. Run it for two weeks. Then add semantic caching on top and compare the marginal improvement. If semantic caching only adds 5-8 percentage points over exact caching, the false positive risk may not justify the complexity.

We're building Preto.ai — LLM cost optimization that detects exact duplicates and semantically similar requests across your traffic. See your cache potential and projected savings before you build anything. Free for up to 10K requests.