The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference

arXiv cs.AI / 4/20/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper shows that KV caching in autoregressive Transformer inference is not numerically equivalent to cache-free computation when using standard FP16, due to differences in floating-point accumulation order.
  • Experiments on three open-weight models (LLaMA-2-7B, Mistral-7B-v0.3, Gemma-2-2B) find a deterministic 100% token divergence rate across decoding strategies, including greedy decoding, indicating the issue is not caused by sampling randomness.
  • Using FP32 control tests reduces divergence by eight orders of magnitude, eliminates token flips, and yields a 0.0% flip rate, confirming FP16 non-associativity as the sole causal factor.
  • Layer-wise analysis and activation patching localize how divergence propagates and identify the stateful KV cache as the key variable, with different attention architectures producing distinct drift patterns.
  • The authors conclude that FP16 KV-cache inference is fundamentally non-equivalent to recomputation and provide a mechanistic framework for understanding numerical instability in modern LLM inference.

Abstract

KV caching is a ubiquitous optimization in autoregressive transformer inference, long presumed to be numerically equivalent to cache-free computation. This assumption fails under standard FP16 precision: cache-ON and cache-OFF execution paths employ different floating-point accumulation orderings which, due to FP16 non-associativity, produce a deterministic divergence in decoded token sequences. Across three open-weight models (LLaMA-2-7B, Mistral-7B-v0.3, Gemma-2-2B) evaluated on GSM8K, we observe a 100\% token divergence rate across all sampling strategies, including greedy decoding, which rules out sampling randomness as a cause, and also with cache-ON yielding higher accuracy in 8 of 9 conditions, where the accuracy difference serves as an indicator that the divergence direction is systematic rather than random. Controlled FP32 falsification reduces divergence by eight orders of magnitude, eliminates token flips, and drops the flip rate to exactly 0.0\%, confirming FP16 non-associativity as the sole causal driver. Layer-wise drift profiling reveals architecturally predictable propagation patterns: models using Grouped-Query Attention exhibit sharp divergence at the first layer, while Gemma's larger head dimension and sliding window attention produce uniform accumulation across all layers. Finally, activation patching of the entire residual stream fails to recover the cache-free trajectory, localizing the causal variable to the stateful KV cache. These findings establish that FP16 KV cache inference is fundamentally non-equivalent to recomputation and provide a mechanistic framework for understanding numerical instability in modern LLM inference systems.