Where Matters More Than What: Decoding-aligned KV Cache Compression via Position-aware Pseudo Queries
arXiv cs.CL / 3/13/2026
📰 NewsModels & Research
Key Points
- The paper identifies that KV cache memory usage grows with context length and proposes DapQ to compress it by evicting tokens using decoding-aligned, position-aware pseudo queries.
- It shows that positional information matters more than semantic content for constructing the pseudo queries, enabling an observation window that mirrors the decoding process.
- DapQ simulates output tokens to align the observation window with generation, allowing more accurate token eviction during inference.
- Experimental results across multiple benchmarks and LLMs demonstrate strong gains under tight memory budgets, including near-lossless performance (99.5%) on NIAH with 3% KV cache budget.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Adversarial AI framework reveals mechanisms behind impaired consciousness and a potential therapy
Reddit r/artificial
Why I Switched From GPT-4 to Small Language Models for Two of My Products
Dev.to

Memristor demonstrates use in fully analog hardware-based neural network
Reddit r/artificial

Understanding Seq2Seq Neural Networks – Part 8: When Does the Decoder Stop?
Dev.to
v0.18.3
Ollama Releases