Beyond Token Eviction: Mixed-Dimension Budget Allocation for Efficient KV Cache Compression

arXiv cs.LG / 3/24/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses KV-cache memory growth in transformer inference, which limits long-context deployment, by moving beyond token eviction’s coarse zero-or-full dimensionality reduction.
It introduces MixedDimKV, which assigns KV cache dimensions to tokens at a finer granularity, and MixedDimKV-H, which additionally uses head-level importance information.
Experiments on long-context benchmarks indicate MixedDimKV outperforms prior KV cache compression methods that do not incorporate head importance profiling.
With the same head-level importance signals, MixedDimKV-H consistently beats HeadKV, achieving performance close to full attention on LongBench using only 6.25% of the KV cache.
In the Needle-in-a-Haystack evaluation, the method preserves 100% accuracy at 50K context length while reducing KV cache usage to as low as 0.26%.

Abstract

Key-value (KV) caching is widely used to accelerate transformer inference, but its memory cost grows linearly with input length, limiting long-context deployment. Existing token eviction methods reduce memory by discarding less important tokens, which can be viewed as a coarse form of dimensionality reduction that assigns each token either zero or full dimension. We propose MixedDimKV, a mixed-dimension KV cache compression method that allocates dimensions to tokens at a more granular level, and MixedDimKV-H, which further integrates head-level importance information. Experiments on long-context benchmarks show that MixedDimKV outperforms prior KV cache compression methods that do not rely on head-level importance profiling. When equipped with the same head-level importance information, MixedDimKV-H consistently outperforms HeadKV. Notably, our approach achieves comparable performance to full attention on LongBench with only 6.25% of the KV cache. Furthermore, in the Needle-in-a-Haystack test, our solution maintains 100% accuracy at a 50K context length while using as little as 0.26% of the cache.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/24DailyView insight →

How AI is Transforming Dynamics 365 Business Central

Dev.to

Algorithmic Gaslighting: A Formal Legal Template to Fight AI Safety Pivots That Cause Psychological Harm

Reddit r/artificial

Do I need different approaches for different types of business information errors?

Dev.to

ShieldCortex: What We Learned Protecting AI Agent Memory

Dev.to

WordPress Theme Customization Without Code: The AI Revolution

Dev.to

Beyond Token Eviction: Mixed-Dimension Budget Allocation for Efficient KV Cache Compression

Key Points

Abstract

💡 Insights using this article

Related Articles

How AI is Transforming Dynamics 365 Business Central

Algorithmic Gaslighting: A Formal Legal Template to Fight AI Safety Pivots That Cause Psychological Harm

Do I need different approaches for different types of business information errors?

ShieldCortex: What We Learned Protecting AI Agent Memory

WordPress Theme Customization Without Code: The AI Revolution

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer