Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit
arXiv cs.AI / 4/20/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- The paper argues that the Shannon entropy limit for per-vector KV-cache compression (as approached by methods like TurboQuant) is not the relevant bottleneck because KV caches are structured as sequences sampled from the model’s training language.
- It introduces “sequential KV compression,” a two-layer method that first deduplicates shared semantic prefixes across sessions using probabilistic prefix deduplication based on Probabilistic Language Tries (PLTs).
- The second layer, predictive delta coding, stores only the residual of each KV vector relative to the model’s own prediction, yielding an entropy bound tied to the next-token conditional entropy rather than raw KV values.
- The authors report an average bound of about 3.3–4.3 bits per token position at typical perplexities for fluent English, and estimate extremely large theoretical compression gains over TurboQuant (up to ~914,000x at the Shannon limit), with robustness claims even under pessimistic overhead assumptions.
- The method is designed to be composable with existing per-vector quantization approaches (including TurboQuant), suggesting it can improve compression without replacing current quantization pipelines.
Related Articles

From Theory to Reality: Why Most AI Agent Projects Fail (And How Mine Did Too)
Dev.to

GPT-5.4-Cyber: OpenAI's Game-Changer for AI Security and Defensive AI
Dev.to

Building Digital Souls: The Brutal Reality of Creating AI That Understands You Like Nobody Else
Dev.to
Local LLM Beginner’s Guide (Mac - Apple Silicon)
Reddit r/artificial

Is Your Skill Actually Good? Systematically Validating Agent Skills with Evals
Dev.to