KVSculpt: KV Cache Compression as Distillation
arXiv cs.LG / 2026/3/31
📰 ニュースSignals & Early TrendsIdeas & Deep AnalysisModels & Research
要点
- KVSculpt targets long-context LLM inference by compressing KV caches beyond quantization/low-rank methods, focusing instead on reducing the effective sequence dimension.
- The method replaces eviction/merging of original KV entries with an optimization of a smaller set of unconstrained KV pairs in continuous embedding space to preserve layer-level attention behavior.
- Keys are optimized using L-BFGS while values are computed analytically via least-squares, alternating optimization steps to make the procedure practical.
- Adaptive budget allocation uses a pilot compression run to reallocate the compression budget across layers and KV heads based on their per-component difficulty.
- On Qwen2.5-1.5B-Instruct at 2048-token contexts, KVSculpt cuts KL divergence by 3.5–4.1x versus Select+Fit across compression ratios, with an additional 1.3x improvement from adaptive allocation without extra inference cost.
