The Residual Stream Is All You Need: On the Redundancy of the KV Cache in Transformer Inference

arXiv cs.LG / 3/23/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The KV cache in transformer inference is redundant because keys and values at every layer are deterministic projections of the residual stream, enabling bit-identical reconstruction from a single residual vector per token.
Across six models from four architecture families, cross-task residual patching yields D_KL = 0 between patched and original output distributions, confirming that the residual stream satisfies a Markov property and is the sole information-carrying state.
Removing the cache entirely and recomputing from scratch yields token-identical output under greedy decoding on all models tested.
KV-Direct is a bounded-memory inference scheme that checkpoints residual vectors (about 5 KB per token on Gemma 3-4B) instead of full KV pairs (about 136 KB), enabling smaller memory footprints.
In experiments over 20 conversation turns, KV-Direct holds peak memory at 42 MB while the standard cache grows past 103 MB; it maintains 100% token match against five eviction baselines, and recomputation can be faster than reading cached tensors at moderate batch sizes; code is available at the provided GitHub link.

Abstract

The key-value (KV) cache is widely treated as essential state in transformer inference, and a large body of work engineers policies to compress, evict, or approximate its entries. We prove that this state is entirely redundant: keys and values at every layer are deterministic projections of the residual stream, and recomputing them from a single residual vector per token incurs exactly zero reconstruction error, not approximately, but bit-identically. We verify this across six models from four architecture families (135M to 4B parameters). Cross-task residual patching at every layer produces D_KL = 0 between patched and original output distributions, confirming that the residual stream satisfies a Markov property and is the sole information-carrying state. Removing the cache entirely and recomputing from scratch yields token-identical output under greedy decoding on all models tested. We build on this result with KV-Direct, a bounded-memory inference scheme that checkpoints residual vectors (5 KB per token on Gemma 3-4B) instead of full KV pairs (136 KB), recomputing keys and values on demand. Over 20 conversation turns, KV-Direct holds peak memory at 42 MB while the standard cache grows past 103 MB. Against five eviction baselines (H2O, StreamingLLM, SnapKV, TOVA, window-only), KV-Direct maintains 100% token match at every cache budget; all baselines degrade to 5-28%. A per-operation latency analysis shows recomputation runs up to 5x faster than reading cached tensors at moderate batch sizes. Code is available at https://github.com/Kaleemullahqasim/KV-Direct.

How to Enforce LLM Spend Limits Per Team Without Slowing Down Your Engineers

Dev.to

v1.82.6.rc.1

LiteLLM Releases

How political censorship actually works inside Qwen, DeepSeek, GLM, and Yi: Ablation and behavioral results across 9 models

Reddit r/LocalLLaMA

Reduce errores y costos de tokens en agentes con seleccion semantica de herramientas

Dev.to

How I Built Enterprise Monitoring Software in 6 Weeks Using Structured AI Development

Dev.to

The Residual Stream Is All You Need: On the Redundancy of the KV Cache in Transformer Inference

Key Points

Abstract

Related Articles

How to Enforce LLM Spend Limits Per Team Without Slowing Down Your Engineers

v1.82.6.rc.1

How political censorship actually works inside Qwen, DeepSeek, GLM, and Yi: Ablation and behavioral results across 9 models

Reduce errores y costos de tokens en agentes con seleccion semantica de herramientas

How I Built Enterprise Monitoring Software in 6 Weeks Using Structured AI Development

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer