The Residual Stream Is All You Need: On the Redundancy of the KV Cache in Transformer Inference
arXiv cs.LG / 3/23/2026
💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research
Key Points
- The KV cache in transformer inference is redundant because keys and values at every layer are deterministic projections of the residual stream, enabling bit-identical reconstruction from a single residual vector per token.
- Across six models from four architecture families, cross-task residual patching yields D_KL = 0 between patched and original output distributions, confirming that the residual stream satisfies a Markov property and is the sole information-carrying state.
- Removing the cache entirely and recomputing from scratch yields token-identical output under greedy decoding on all models tested.
- KV-Direct is a bounded-memory inference scheme that checkpoints residual vectors (about 5 KB per token on Gemma 3-4B) instead of full KV pairs (about 136 KB), enabling smaller memory footprints.
- In experiments over 20 conversation turns, KV-Direct holds peak memory at 42 MB while the standard cache grows past 103 MB; it maintains 100% token match against five eviction baselines, and recomputation can be faster than reading cached tensors at moderate batch sizes; code is available at the provided GitHub link.
Related Articles
How to Enforce LLM Spend Limits Per Team Without Slowing Down Your Engineers
Dev.to
v1.82.6.rc.1
LiteLLM Releases
How political censorship actually works inside Qwen, DeepSeek, GLM, and Yi: Ablation and behavioral results across 9 models
Reddit r/LocalLLaMA
Reduce errores y costos de tokens en agentes con seleccion semantica de herramientas
Dev.to
How I Built Enterprise Monitoring Software in 6 Weeks Using Structured AI Development
Dev.to