When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding?
arXiv cs.CL / 4/30/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper investigates why speculative decoding with hidden-state-based draft models suffers from long-range decay, where draft accuracy drops as the speculative step increases.
- It argues that hidden-state reuse works like biased context compression tied to the current attention query, potentially discarding information needed for later speculative steps.
- The authors propose the KV-Reuse Hypothesis: letting the draft model reuse the target model’s KV cache can retain explicit, token-wise context and improve long-horizon drafting.
- They introduce KVShot, a diagnostic framework comparing hidden-only, KV-only, and hybrid reuse, and report that KV-Reuse improves long-range acceptance on Qwen3-8B, though end-to-end speedups are still limited.
- The analysis finds structural bottlenecks—insufficiently deep draft models for accurate target-query estimation and sparse gradient signals for draft-side KV projections—implying that KV-aware decoding may require block-wise training rather than only TTT.
Related Articles
Vector DB and ANN vs PHE conflict, is there a practical workaround? [D]
Reddit r/MachineLearning

Agent Amnesia and the Case of Henry Molaison
Dev.to

Azure Weekly: Microsoft and OpenAI Restructure Partnership as GPT-5.5 Lands in Foundry
Dev.to

Proven Patterns for OpenAI Codex in 2026: Prompts, Validation, and Gateway Governance
Dev.to

Vibe coding is a tool, not a shortcut. Most people are using it wrong.
Dev.to