Where Matters More Than What: Decoding-aligned KV Cache Compression via Position-aware Pseudo Queries
arXiv cs.CL / 3/13/2026
📰 NewsModels & Research
Key Points
- The paper identifies that KV cache memory usage grows with context length and proposes DapQ to compress it by evicting tokens using decoding-aligned, position-aware pseudo queries.
- It shows that positional information matters more than semantic content for constructing the pseudo queries, enabling an observation window that mirrors the decoding process.
- DapQ simulates output tokens to align the observation window with generation, allowing more accurate token eviction during inference.
- Experimental results across multiple benchmarks and LLMs demonstrate strong gains under tight memory budgets, including near-lossless performance (99.5%) on NIAH with 3% KV cache budget.
Related Articles
Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders
Reddit r/LocalLLaMA

How to Train Custom Language Models: Fine-Tuning vs Training From Scratch (2026)
Dev.to

KoboldCpp 1.110 - 3 YR Anniversary Edition, native music gen, qwen3tts voice cloning and more
Reddit r/LocalLLaMA
Qwen3.5 Knowledge density and performance
Reddit r/LocalLLaMA
I think I made the best general use System Prompt for Qwen 3.5 (OpenWebUI + Web search)
Reddit r/LocalLLaMA