KVSculpt: KV Cache Compression as Distillation

arXiv cs.LG / 2026/3/31

📰 ニュースSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

要点

KVSculpt targets long-context LLM inference by compressing KV caches beyond quantization/low-rank methods, focusing instead on reducing the effective sequence dimension.
The method replaces eviction/merging of original KV entries with an optimization of a smaller set of unconstrained KV pairs in continuous embedding space to preserve layer-level attention behavior.
Keys are optimized using L-BFGS while values are computed analytically via least-squares, alternating optimization steps to make the procedure practical.
Adaptive budget allocation uses a pilot compression run to reallocate the compression budget across layers and KV heads based on their per-component difficulty.
On Qwen2.5-1.5B-Instruct at 2048-token contexts, KVSculpt cuts KL divergence by 3.5–4.1x versus Select+Fit across compression ratios, with an additional 1.3x improvement from adaptive allocation without extra inference cost.

Abstract

KV cache compression is critical for efficient long-context LLM inference. Approaches that reduce the per-pair footprint -- quantization and low-rank decomposition -- are orthogonal to those that reduce the sequence length of the cache. Along the sequence-length dimension, existing methods range from pure eviction -- selecting which KV pairs to keep -- to merging, which combines similar pairs into fewer ones. Both remain anchored to the original cache entries. We propose KVSculpt, which moves to the other end of this spectrum: instead of selecting or combining original pairs, we optimize a smaller set of unconstrained KV pairs in continuous embedding space to preserve each layer's attention behavior. Keys are optimized via L-BFGS and values are solved in closed form via least squares, alternating every few steps. On top of this, we introduce adaptive budget allocation, which uses a cheap pilot compression run to redistribute the compression budget across layers and KV heads based on per-component difficulty. On Qwen2.5-1.5B-Instruct with 2048-token contexts, KVSculpt reduces KL divergence by 3.5-4.1x compared to Select+Fit -- attention-score eviction with least-squares value fitting -- across compression ratios r in {0.3, 0.5, 0.7}. Adaptive allocation provides an additional 1.3x KL reduction at no extra inference cost. Analysis reveals that compression difficulty is highly non-uniform: per-layer pilot MSE varies by up to 100x across layers, and the two KV heads within a single layer can differ by up to 467x -- demonstrating that fine-grained budget allocation is essential.

Black Hat Asia

AI Business

ラピダスCTO、1ナノでTSMCと「半年差に」まずは信頼獲得から

日経XTECH

「Galaxy S26 Ultra」、のぞき見防ぐ最上機買って分かったAIの進化

日経XTECH

RotorQuant vs TurboQuant — KVキャッシュ量子化の最前線

Qiita

【備忘録】分類モデルの基本的な評価指標（Accuracy / Recall / Precision / F1スコア）まとめ

Qiita

KVSculpt: KV Cache Compression as Distillation

要点

Abstract

関連記事

Black Hat Asia

ラピダスCTO、1ナノでTSMCと「半年差に」まずは信頼獲得から

「Galaxy S26 Ultra」、のぞき見防ぐ最上機買って分かったAIの進化

RotorQuant vs TurboQuant — KVキャッシュ量子化の最前線

【備忘録】分類モデルの基本的な評価指標（Accuracy / Recall / Precision / F1スコア）まとめ

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

要点

Abstract

関連記事

Black Hat Asia

ラピダスCTO、1ナノでTSMCと「半年差に」 まずは信頼獲得から

「Galaxy S26 Ultra」、のぞき見防ぐ最上機 買って分かったAIの進化

RotorQuant vs TurboQuant — KVキャッシュ量子化の最前線

【備忘録】分類モデルの基本的な評価指標（Accuracy / Recall / Precision / F1スコア）まとめ

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

ラピダスCTO、1ナノでTSMCと「半年差に」まずは信頼獲得から

「Galaxy S26 Ultra」、のぞき見防ぐ最上機買って分かったAIの進化