eOptShrinkQ: Near-Lossless KV Cache Compression Through Optimal Spectral Denoising and Quantization

arXiv cs.LG / 5/6/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper argues that transformer KV caches naturally split into a low-rank “shared context” part and a full-rank per-token residual, modeled well by a spiked random matrix framework.
  • It introduces eOptShrinkQ, a two-stage KV compression method that first applies optimal singular value shrinkage to capture the shared structure, then quantizes the residual using TurboQuant.
  • Spectral denoising is used to restore isotropy for scalar quantization, removing the need for outlier handling and inner-product bias correction while reallocating bits to improve reconstruction quality.
  • Random matrix theory is used to provide guarantees including automatic rank selection (via the BBP phase transition), near-zero inner-product bias on the residual, and delocalized coordinates that enable near-optimal quantization distortion.
  • Experiments on Llama-3.1-8B and Ministral-8B show eOptShrinkQ achieves better quality–bit tradeoffs than TurboQuant (e.g., ~2.2 bits vs 3.0 bits on LongBench) and can match or beat uncompressed FP16 in multi-needle retrieval, suggesting a regularization benefit for retrieval-heavy tasks.

Abstract

We show that the key-value (KV) cache in transformer attention heads admits a natural decomposition into a low-rank \emph{shared context} component and a full-rank \emph{per-token} residual, well described by the spiked random matrix model. This observation leads to eOptShrinkQ, a two-stage compression pipeline: optimal singular value shrinkage (eOptShrink) automatically extracts the shared structure, and the residual -- which satisfies the \emph{thin shell property} with delocalized coordinates -- is quantized by TurboQuant~\citep{zandieh2025turboquant}, a recently proposed per-vector scalar quantizer with near-optimal distortion guarantees. By restoring the isotropy that scalar quantization assumes, spectral denoising eliminates the need for both outlier handling and dedicated inner product bias correction, freeing those bits for improved reconstruction. The theoretical grounding in random matrix theory provides three guarantees: automatic rank selection via the BBP phase transition, provably near-zero inner product bias on the residual, and coordinate delocalization ensuring near-optimal quantization distortion. Experimentally, we validate eOptShrinkQ on Llama-3.1-8B and Ministral-8B across three levels: per-head MSE and inner product fidelity, where eOptShrinkQ saves nearly one bit per entry over TurboQuant at equivalent quality; end-to-end on LongBench (16 tasks), where eOptShrinkQ at \sim2.2 bits per entry outperforms TurboQuant at 3.0 bits; and multi-needle retrieval, where eOptShrinkQ at 2.2 bits closely matches or exceeds uncompressed FP16, suggesting that spectral denoising can act as a beneficial regularizer for retrieval-intensive tasks.

eOptShrinkQ: Near-Lossless KV Cache Compression Through Optimal Spectral Denoising and Quantization | AI Navigate