KLD measurements of 8 different llama.cpp KV cache quantizations over several 8-12B models

Reddit r/LocalLLaMA / 3/24/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The post reports self-run KLD (relative entropy) measurements comparing llama.cpp KV cache quantization settings across eight 8B–12B models (e.g., Qwen3.5 9B, Qwen3-VL 8B, Gemma 3 12B, Ministral 3 8B, Irix 12B).
  • The author generates base logits using an f16 KV cache and then evaluates how quantizing the KV cache (tested via llama.cpp) changes the resulting output distributions using KLD.
  • Due to limited GPU VRAM, the author uses already-quantized models (IQ4_XS) to produce logits, arguing that KLD still meaningfully reflects divergence caused by KV quantization when the model weights are held fixed.
  • The methodology uses llama-perplexity-style KLD computation over parts of the context window (mostly validated on wikitext-2 and a stitched 200k-token RP test, with results largely matching wikitext-2).
  • Some quantization variants (e.g., iq4_nl) could not be run on CUDA, so they are excluded from the comparison.
KLD measurements of 8 different llama.cpp KV cache quantizations over several 8-12B models

A couple of weeks ago i was wondering about the impact of KV quantization, so i tried looking for any PPL or KLD measurements but didn't find anything extensive. I did some of my own and these are the results. Models included: Qwen3.5 9B, Qwen3 VL 8B, Gemma 3 12B, Ministral 3 8B, Irix 12B (Mistral Nemo)

Disclaimers

  • I am very GPU poor with a meager 6gb of vram, therefore all logits were generated with already quantized models (in this case they're all IQ4_XS), so that i could actually run them. The silver lining is that since KLD measures relative entropy, these numbers will still tell you how different the output logits would be with a quantized KV cache while using the same quantized model.
  • I'm not 100% sure you can get any meaningful information out of this. Llama-perplexity computes KLD over the latter half of each context window it processes, if it was possible i would've set it up with some real instruct conversations and measure KLD only on the assistant messages, with maybe a separate test targeting tool calls specifically. I actually did run one of the models through a text file made up of stitched RP segments totaling 200k tokens (wikitext-2 is 300k), but all the results i got from it were pretty much exactly the same as wikitext's, so i dropped it for the more standardized option to save time and spare my ssd some suffering.
  • I couldn't get iq4_nl to run on cuda for some reason so it's not included.

Methodology

Llama.cpp b8288 (b5fe4559a), built with GGML_CUDA_FA_ALL_QUANTS. Base logits generated at f16 KV. For the "long" variant of wikitext, all models had their context size cranked up to the highest power of 2 that didn't crash llama-perplexity, which was 16k for Ministral and Irix, 8k for Qwen3.5 and Qwen3 VL, and 4k for Gemma 3. Otherwise the default context size set by llama-perplexity is 512.

Results

Normal wikitext-2

Long wikitext-2

Before running wikitext i did a bunch of tests on a small (32k tokens) conversation to make sure that everything worked correctly, same context sizes as long wikitext. At this point i saw a thread talking about Bartowski's quants having better KLDs than Unsloth's for Qwen3.5 9B, so i tested both. For wikitext i only used Bartowski's quant. I wouldn't take any of these numbers too seriously considering the low number of samples.

Test conversation

More results

All of the complete results given by llama-perplexity including PPL and token statistics have been uploaded to this repo, in case you want to inspect them (don't ask me why ± and Δp got turned into japanese characters, the terminal just did that).

Personal observations

  • The KLD impact from KV quantization in general seems to be a bit lower than "equivalent" weight quants, but i can't really make any conclusions with that because it's unclear how the two are compounded. I'm considering running more tests with a model i can actually load in bf16 (like qwen3.5 2B) to explore this aspect.
  • Qwen3 VL very much doesn't like having its KV quantized.
submitted by /u/Velocita84
[link] [comments]