KLD measurements of 8 different llama.cpp KV cache quantizations over several 8-12B models

Reddit r/LocalLLaMA / 3/24/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The post reports self-run KLD (relative entropy) measurements comparing llama.cpp KV cache quantization settings across eight 8B–12B models (e.g., Qwen3.5 9B, Qwen3-VL 8B, Gemma 3 12B, Ministral 3 8B, Irix 12B).
The author generates base logits using an f16 KV cache and then evaluates how quantizing the KV cache (tested via llama.cpp) changes the resulting output distributions using KLD.
Due to limited GPU VRAM, the author uses already-quantized models (IQ4_XS) to produce logits, arguing that KLD still meaningfully reflects divergence caused by KV quantization when the model weights are held fixed.
The methodology uses llama-perplexity-style KLD computation over parts of the context window (mostly validated on wikitext-2 and a stitched 200k-token RP test, with results largely matching wikitext-2).
Some quantization variants (e.g., iq4_nl) could not be run on CUDA, so they are excluded from the comparison.

KLD measurements of 8 different llama.cpp KV cache quantizations over several 8-12B models

A couple of weeks ago i was wondering about the impact of KV quantization, so i tried looking for any PPL or KLD measurements but didn't find anything extensive. I did some of my own and these are the results. Models included: Qwen3.5 9B, Qwen3 VL 8B, Gemma 3 12B, Ministral 3 8B, Irix 12B (Mistral Nemo)

Disclaimers

I am very GPU poor with a meager 6gb of vram, therefore all logits were generated with already quantized models (in this case they're all IQ4_XS), so that i could actually run them. The silver lining is that since KLD measures relative entropy, these numbers will still tell you how different the output logits would be with a quantized KV cache while using the same quantized model.
I'm not 100% sure you can get any meaningful information out of this. Llama-perplexity computes KLD over the latter half of each context window it processes, if it was possible i would've set it up with some real instruct conversations and measure KLD only on the assistant messages, with maybe a separate test targeting tool calls specifically. I actually did run one of the models through a text file made up of stitched RP segments totaling 200k tokens (wikitext-2 is 300k), but all the results i got from it were pretty much exactly the same as wikitext's, so i dropped it for the more standardized option to save time and spare my ssd some suffering.
I couldn't get iq4_nl to run on cuda for some reason so it's not included.

Methodology

Llama.cpp b8288 (b5fe4559a), built with GGML_CUDA_FA_ALL_QUANTS. Base logits generated at f16 KV. For the "long" variant of wikitext, all models had their context size cranked up to the highest power of 2 that didn't crash llama-perplexity, which was 16k for Ministral and Irix, 8k for Qwen3.5 and Qwen3 VL, and 4k for Gemma 3. Otherwise the default context size set by llama-perplexity is 512.

Results

Normal wikitext-2

Long wikitext-2

Before running wikitext i did a bunch of tests on a small (32k tokens) conversation to make sure that everything worked correctly, same context sizes as long wikitext. At this point i saw a thread talking about Bartowski's quants having better KLDs than Unsloth's for Qwen3.5 9B, so i tested both. For wikitext i only used Bartowski's quant. I wouldn't take any of these numbers too seriously considering the low number of samples.

Test conversation

More results

All of the complete results given by llama-perplexity including PPL and token statistics have been uploaded to this repo, in case you want to inspect them (don't ask me why ± and Δp got turned into japanese characters, the terminal just did that).

Personal observations

The KLD impact from KV quantization in general seems to be a bit lower than "equivalent" weight quants, but i can't really make any conclusions with that because it's unclear how the two are compounded. I'm considering running more tests with a model i can actually load in bf16 (like qwen3.5 2B) to explore this aspect.
Qwen3 VL very much doesn't like having its KV quantized.

submitted by /u/Velocita84
[link] [comments]

The Moonwell Oracle Exploit: How AI-Assisted 'Vibe Coding' Turned cbETH Into a $1.12 Token and Cost $1.78M

Dev.to

How CVE-2026-25253 exposed every OpenClaw user to RCE — and how to fix it in one command

Dev.to

Day 10: An AI Agent's Revenue Report — $29, 25 Products, 160 Tweets

Dev.to

Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

Dev.to

What CVE-2026-25253 Taught Me About Building Safe AI Assistants

Dev.to

KLD measurements of 8 different llama.cpp KV cache quantizations over several 8-12B models

Key Points

Disclaimers

Methodology

Results

More results

Personal observations

Related Articles

The Moonwell Oracle Exploit: How AI-Assisted 'Vibe Coding' Turned cbETH Into a $1.12 Token and Cost $1.78M

How CVE-2026-25253 exposed every OpenClaw user to RCE — and how to fix it in one command

Day 10: An AI Agent's Revenue Report — $29, 25 Products, 160 Tweets

Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

What CVE-2026-25253 Taught Me About Building Safe AI Assistants

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer