The Residual Stream Is All You Need: On the Redundancy of the KV Cache in Transformer Inference

arXiv cs.LG / 3/23/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The KV cache in transformer inference is redundant because keys and values at every layer are deterministic projections of the residual stream, enabling bit-identical reconstruction from a single residual vector per token.
  • Across six models from four architecture families, cross-task residual patching yields D_KL = 0 between patched and original output distributions, confirming that the residual stream satisfies a Markov property and is the sole information-carrying state.
  • Removing the cache entirely and recomputing from scratch yields token-identical output under greedy decoding on all models tested.
  • KV-Direct is a bounded-memory inference scheme that checkpoints residual vectors (about 5 KB per token on Gemma 3-4B) instead of full KV pairs (about 136 KB), enabling smaller memory footprints.
  • In experiments over 20 conversation turns, KV-Direct holds peak memory at 42 MB while the standard cache grows past 103 MB; it maintains 100% token match against five eviction baselines, and recomputation can be faster than reading cached tensors at moderate batch sizes; code is available at the provided GitHub link.

Abstract

The key-value (KV) cache is widely treated as essential state in transformer inference, and a large body of work engineers policies to compress, evict, or approximate its entries. We prove that this state is entirely redundant: keys and values at every layer are deterministic projections of the residual stream, and recomputing them from a single residual vector per token incurs exactly zero reconstruction error, not approximately, but bit-identically. We verify this across six models from four architecture families (135M to 4B parameters). Cross-task residual patching at every layer produces D_KL = 0 between patched and original output distributions, confirming that the residual stream satisfies a Markov property and is the sole information-carrying state. Removing the cache entirely and recomputing from scratch yields token-identical output under greedy decoding on all models tested. We build on this result with KV-Direct, a bounded-memory inference scheme that checkpoints residual vectors (5 KB per token on Gemma 3-4B) instead of full KV pairs (136 KB), recomputing keys and values on demand. Over 20 conversation turns, KV-Direct holds peak memory at 42 MB while the standard cache grows past 103 MB. Against five eviction baselines (H2O, StreamingLLM, SnapKV, TOVA, window-only), KV-Direct maintains 100% token match at every cache budget; all baselines degrade to 5-28%. A per-operation latency analysis shows recomputation runs up to 5x faster than reading cached tensors at moderate batch sizes. Code is available at https://github.com/Kaleemullahqasim/KV-Direct.