Do not use mixed KV cache quantization

Reddit r/LocalLLaMA / 3/29/2026

💬 OpinionTools & Practical UsageModels & Research

Key Points

  • The post argues against “mixed quantization” for the attention KV cache, saying it is incorrect and leads to misleading results even if it seems to save memory while preserving accuracy.
  • It references a longer blog explanation and presents benchmark evidence comparing different KV cache quantization settings (e.g., mixing f16/q8_0 vs using q8_0 for both K and V).
  • The benchmark table shows large and inconsistent performance differences across runs, suggesting that mixing K and V quantization formats can materially affect throughput rather than behaving as a neutral memory/accuracy tradeoff.
  • The author concludes users should avoid mixed KV cache quantization and instead use consistent quantization settings for K and V to get reliable behavior.

I've seen a few people in the comments on here and the other AI subs suggest mixing quantization for the KV cache to retain higher accuracy and still saving memory. I was running that for a while until I realized how wrong it is.

I wrote a longer blogpost about it, but TL;DR is this benchmark run:

model size params backend ngl n_batch type_k type_v fa test t/s
qwen35 9B Q6_K 6.84 GiB 8.95 B Vulkan 99 1024 f16 q8_0 1 pp5000 334.27 ± 1.42
qwen35 9B Q6_K 6.84 GiB 8.95 B Vulkan 99 1024 f16 q8_0 1 tg128 53.53 ± 0.23
qwen35 9B Q6_K 6.84 GiB 8.95 B Vulkan 99 1024 q8_0 q8_0 1 pp5000 952.79 ± 0.46
qwen35 9B Q6_K 6.84 GiB 8.95 B Vulkan 99 1024 q8_0 q8_0 1 tg128 63.37 ± 0.06
submitted by /u/L3tum
[link] [comments]