Beyond Token Eviction: Mixed-Dimension Budget Allocation for Efficient KV Cache Compression
arXiv cs.LG / 2026/3/24
📰 ニュースSignals & Early TrendsIdeas & Deep AnalysisModels & Research
要点
- The paper addresses KV-cache memory growth in transformer inference, which limits long-context deployment, by moving beyond token eviction’s coarse zero-or-full dimensionality reduction.
- It introduces MixedDimKV, which assigns KV cache dimensions to tokens at a finer granularity, and MixedDimKV-H, which additionally uses head-level importance information.
- Experiments on long-context benchmarks indicate MixedDimKV outperforms prior KV cache compression methods that do not incorporate head importance profiling.
- With the same head-level importance signals, MixedDimKV-H consistently beats HeadKV, achieving performance close to full attention on LongBench using only 6.25% of the KV cache.
- In the Needle-in-a-Haystack evaluation, the method preserves 100% accuracy at 50K context length while reducing KV cache usage to as low as 0.26%.

