When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding?

arXiv cs.CL / 4/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper investigates why speculative decoding with hidden-state-based draft models suffers from long-range decay, where draft accuracy drops as the speculative step increases.
  • It argues that hidden-state reuse works like biased context compression tied to the current attention query, potentially discarding information needed for later speculative steps.
  • The authors propose the KV-Reuse Hypothesis: letting the draft model reuse the target model’s KV cache can retain explicit, token-wise context and improve long-horizon drafting.
  • They introduce KVShot, a diagnostic framework comparing hidden-only, KV-only, and hybrid reuse, and report that KV-Reuse improves long-range acceptance on Qwen3-8B, though end-to-end speedups are still limited.
  • The analysis finds structural bottlenecks—insufficiently deep draft models for accurate target-query estimation and sparse gradient signals for draft-side KV projections—implying that KV-aware decoding may require block-wise training rather than only TTT.

Abstract

Speculative decoding accelerates LLM inference, but SOTA hidden-state-based drafters suffer from long-range decay: draft accuracy degrades as the speculative step increases. Existing work attributes this decay to train-inference mismatch and proposes test-time training (TTT) as a remedy, yet we observe that long-range decay persists even in TTT-trained drafters. We revisit long-range decay from the perspective of context information preservation. In hidden-state reuse, we argue the target hidden state acts as a biased context compression: it aggregates historical token information according to the attention query at the current position, yielding a compact representation optimized for immediate next-token prediction. This compression can suppress information less relevant to the current query but important for later speculative steps. In contrast, the target model's KV cache serves as an explicit context, retaining the complete set of token-wise KV representations. We therefore posit the KV-Reuse Hypothesis: allowing the draft model to reuse the target KV cache can provide richer signals for long-horizon drafting. To test this hypothesis, we introduce KVShot, a diagnostic framework that compares three reuse paradigms: hidden-only, KV-only, and hybrid. Extensive evaluations on Qwen3-8B show that KV-Reuse improves long-range acceptance, although end-to-end speedups remain marginal under current training pipelines. Our analysis identifies two key structural bottlenecks: shallow drafters struggle to estimate target queries accurately, and draft-side KV projections receive sparse gradient signals. These findings suggest that realizing the full potential of KV-aware decoding requires moving beyond TTT toward block-wise training paradigms. By exposing these bottlenecks, KVShot provides a foundational diagnostic testbed and a clear roadmap for designing next-generation inference architectures.