HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization

arXiv cs.LG / 5/6/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper argues that KV-cache quantization should be evaluated and corrected in model-visible coordinates (score/logit space) rather than only minimizing storage-space reconstruction error like raw key MSE.
It introduces HeadQ, a key-side quantization method that stores a low-rank residual side code in a calibration-learned query basis and applies it as an additive correction to attention logits.
For values, it proposes a distortion surrogate based on fixed-attention readout, using an A^2-weighted token-distortion metric to better reflect attention effects.
Across six tested models, score/Fisher-space error predicts attention KL divergence more accurately than raw key MSE, and multiple counterexamples and controls falsify storage-MSE-based alternatives.
In K-only WikiText-103 decoding (with dense values), HeadQ removes about 84–94% of the excess perplexity on the strongest 2-bit quantization rows, and combining HeadQ with an A^2 value policy improves performance in a full-KV 2-bit composition.

Abstract

KV-cache quantizers usually optimize storage-space reconstruction, even though attention reads keys through logits and values through attention-weighted readout. We argue that persistent cache error should be measured in model-visible coordinates. For keys, the visible object is score error modulo constant shifts; this yields HeadQ, a key-side method that stores a low-rank residual side code in a calibration-learned query basis and applies it as an additive logit correction. For values, fixed-attention readout gives an

A^2

-weighted token-distortion surrogate. Across six models, Fisher/score-space error predicts attention KL far better than raw key MSE; same-budget counterexamples, null-space interventions, query-PCA controls, and wrong-sign HeadQ falsify storage-MSE alternatives. Matched Pythia checkpoints localize the main anomaly to a small-model low-entropy route-flip boundary. In K-only WikiText-103 decode experiments with dense values, HeadQ removes roughly

84

94\%

of the excess perplexity on the strongest 2-bit rows; in an auxiliary full-KV 2-bit composition, HeadQ plus an

A^2

value policy improves all six models.

SIFS (SIFS Is Fast Search) - local code search for coding agents

Dev.to

BizNode's semantic memory (Qdrant) makes your bot smarter over time — it remembers past conversations and answers...

Dev.to

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

MarkTechPost

Solidity LM surpasses Opus

Reddit r/LocalLLaMA

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...)

Reddit r/LocalLLaMA

HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization

Key Points

Abstract

Related Articles

SIFS (SIFS Is Fast Search) - local code search for coding agents

BizNode's semantic memory (Qdrant) makes your bot smarter over time — it remembers past conversations and answers...

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

Solidity LM surpasses Opus

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer