The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference
arXiv cs.AI / 4/20/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper shows that KV caching in autoregressive Transformer inference is not numerically equivalent to cache-free computation when using standard FP16, due to differences in floating-point accumulation order.
- Experiments on three open-weight models (LLaMA-2-7B, Mistral-7B-v0.3, Gemma-2-2B) find a deterministic 100% token divergence rate across decoding strategies, including greedy decoding, indicating the issue is not caused by sampling randomness.
- Using FP32 control tests reduces divergence by eight orders of magnitude, eliminates token flips, and yields a 0.0% flip rate, confirming FP16 non-associativity as the sole causal factor.
- Layer-wise analysis and activation patching localize how divergence propagates and identify the stateful KV cache as the key variable, with different attention architectures producing distinct drift patterns.
- The authors conclude that FP16 KV-cache inference is fundamentally non-equivalent to recomputation and provide a mechanistic framework for understanding numerical instability in modern LLM inference.
Related Articles
From Theory to Reality: Why Most AI Agent Projects Fail (And How Mine Did Too)
Dev.to
GPT-5.4-Cyber: OpenAI's Game-Changer for AI Security and Defensive AI
Dev.to
Building Digital Souls: The Brutal Reality of Creating AI That Understands You Like Nobody Else
Dev.to
Local LLM Beginner’s Guide (Mac - Apple Silicon)
Reddit r/artificial
Is Your Skill Actually Good? Systematically Validating Agent Skills with Evals
Dev.to