Do not use mixed KV cache quantization

Reddit r/LocalLLaMA / 3/29/2026

💬 OpinionTools & Practical UsageModels & Research

共有:

Key Points

The post argues against “mixed quantization” for the attention KV cache, saying it is incorrect and leads to misleading results even if it seems to save memory while preserving accuracy.
It references a longer blog explanation and presents benchmark evidence comparing different KV cache quantization settings (e.g., mixing f16/q8_0 vs using q8_0 for both K and V).
The benchmark table shows large and inconsistent performance differences across runs, suggesting that mixing K and V quantization formats can materially affect throughput rather than behaving as a neutral memory/accuracy tradeoff.
The author concludes users should avoid mixed KV cache quantization and instead use consistent quantization settings for K and V to get reliable behavior.

I've seen a few people in the comments on here and the other AI subs suggest mixing quantization for the KV cache to retain higher accuracy and still saving memory. I was running that for a while until I realized how wrong it is.

I wrote a longer blogpost about it, but TL;DR is this benchmark run:

model	size	params	backend	ngl	n_batch	type_k	type_v	fa	test	t/s
qwen35 9B Q6_K	6.84 GiB	8.95 B	Vulkan	99	1024	f16	q8_0	1	pp5000	334.27 ± 1.42
qwen35 9B Q6_K	6.84 GiB	8.95 B	Vulkan	99	1024	f16	q8_0	1	tg128	53.53 ± 0.23
qwen35 9B Q6_K	6.84 GiB	8.95 B	Vulkan	99	1024	q8_0	q8_0	1	pp5000	952.79 ± 0.46
qwen35 9B Q6_K	6.84 GiB	8.95 B	Vulkan	99	1024	q8_0	q8_0	1	tg128	63.37 ± 0.06