I found why KV cache INT4 breaks on some models (Qwen2-7B: ΔPPL +238) and built a 4-line fix no training, no calibration, 12 models tested up to 40B

Reddit r/LocalLLaMA / 4/16/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The article explains that KV-cache INT4 quantization can succeed on some transformer models but catastrophically fail on others, as shown by Qwen2-7B (ΔPPL +238) versus Falcon-40B (ΔPPL +0.08).
  • It identifies two interacting root causes: token-wise KV norm variation in Pre-LN models and activation outlier channels that force quantization scales to collapse precision for all other channels.
  • The proposed drop-in fix “nsep+pchan” decomposes each KV vector into an FP16 norm scalar and a unit-direction vector, then applies per-channel scaling and quantizes only the direction with clamping to INT4 range.
  • Results across 12 models up to 40B show drastic recovery for previously broken cases, including 744× improvement on Qwen2-7B and 82× on Pythia-6.9B, with the fix reported to never meaningfully hurt performance.

I've been playing with KV cache INT4 quantization and noticed something weird: it works perfectly on some models and completely destroys others.

Examples:

  • Falcon-40B: ΔPPL +0.08 ✅ (basically free compression)
  • OPT-13B: ΔPPL +0.28 ✅
  • Qwen2-7B: ΔPPL +238 ❌ (output becomes incoherent garbage)
  • Pythia-6.9B: ΔPPL +22 ❌
  • Pythia-410M: ΔPPL +77 ❌

Same quantization method. Why does it break on some and not others?

Root cause: two independent problems

  1. Token-wise norm variation — KV vector norms fluctuate 2-5x across tokens in Pre-LN models. Per-row absmax gives inconsistent quantization precision across the sequence.
  2. Activation outlier channels — Certain channels have values 8-100x larger than average (Qwen2-7B Layer 0: 8 channels at absmax 167). They hijack the quantization scale and kill precision on all other channels.

Fixing only one doesn't help much:

  • Norm separation only → +57.5 (still bad)
  • Per-channel only → +97.8 (still bad)
  • Both combined → +0.32 (744x improvement)

The fix (nsep+pchan)

Before quantizing, decompose each KV vector into norm (FP16 scalar) and direction (unit vector, quantized). Then use per-channel scaling. 4 lines of PyTorch:

norm = x.norm(dim=-1, keepdim=True) direction = x / norm scale = direction.abs().amax(dim=-1, keepdim=True) / 7 direction_q = (direction / scale).round().clamp(-7, 7) * scale 

No training. No calibration. No model-specific tuning. Drop-in preprocessing step — works in front of whatever quantization you're already using.

Results across 12 models (124M to 40B)

Model naive INT4 nsep+pchan Improvement
Qwen2-7B +238 +0.32 744×
Pythia-6.9B +22 +0.27 82×
Pythia-12B +27 +1.82 15×
Pythia-410M +78 +12.62
Falcon-40B +0.08 +0.04
OPT-13B +0.28 +0.35

Full table with all 12 models in the paper.

Key: it never hurts. Worst case degradation is +0.24 ΔPPL (OPT-125m). Models that already work fine under naive INT4 see no meaningful change.

Long context gets even crazier

At 4096 tokens on Qwen2-7B:

  • naive INT4: ΔPPL +8293
  • nsep+pchan: ΔPPL +0.19
  • That's a 44,000× improvement

The error accumulates in attention computation as the KV cache grows. Norm separation prevents this compounding.

Bonus: INT3 > INT4 on Qwen2-7B

This one surprised me. On Qwen2-7B, INT3 (ΔPPL +6.6) is actually 36x better than INT4 (ΔPPL +238). The reason: INT4 maps mid-range values to noisy non-zeros because of the outlier-dominated scale. INT3's coarser grid maps them to clean zeros instead. In attention, clean zeros beat noisy non-zeros. Per-channel quantization fixes this for both bit widths.

PyTorch .norm() trap

While building this, I got bitten by a PyTorch API gotcha: x.norm(-1, keepdim=True) computes L_{-1} norm, NOT L2 norm. The first arg is p (norm order), not dim. Correct: x.norm(dim=-1, keepdim=True). Was invisible on CPU, exploded 3000x on CUDA. Embarrassing but posting in case it saves someone else the debugging.

Paper: https://doi.org/10.5281/zenodo.19590278

Code + all results: https://github.com/metaSATOKEN/norm-separated-quantization

Happy to answer questions. All experiments are reproducible.

submitted by /u/Afraid_Project_8666
[link] [comments]