| A couple of weeks ago i was wondering about the impact of KV quantization, so i tried looking for any PPL or KLD measurements but didn't find anything extensive. I did some of my own and these are the results. Models included: Qwen3.5 9B, Qwen3 VL 8B, Gemma 3 12B, Ministral 3 8B, Irix 12B (Mistral Nemo) Disclaimers
MethodologyLlama.cpp b8288 (b5fe4559a), built with ResultsBefore running wikitext i did a bunch of tests on a small (32k tokens) conversation to make sure that everything worked correctly, same context sizes as long wikitext. At this point i saw a thread talking about Bartowski's quants having better KLDs than Unsloth's for Qwen3.5 9B, so i tested both. For wikitext i only used Bartowski's quant. I wouldn't take any of these numbers too seriously considering the low number of samples. More resultsAll of the complete results given by llama-perplexity including PPL and token statistics have been uploaded to this repo, in case you want to inspect them (don't ask me why ± and Δp got turned into japanese characters, the terminal just did that). Personal observations
[link] [comments] |
KLD measurements of 8 different llama.cpp KV cache quantizations over several 8-12B models
Reddit r/LocalLLaMA / 3/24/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- The post reports self-run KLD (relative entropy) measurements comparing llama.cpp KV cache quantization settings across eight 8B–12B models (e.g., Qwen3.5 9B, Qwen3-VL 8B, Gemma 3 12B, Ministral 3 8B, Irix 12B).
- The author generates base logits using an f16 KV cache and then evaluates how quantizing the KV cache (tested via llama.cpp) changes the resulting output distributions using KLD.
- Due to limited GPU VRAM, the author uses already-quantized models (IQ4_XS) to produce logits, arguing that KLD still meaningfully reflects divergence caused by KV quantization when the model weights are held fixed.
- The methodology uses llama-perplexity-style KLD computation over parts of the context window (mostly validated on wikitext-2 and a stitched 200k-token RP test, with results largely matching wikitext-2).
- Some quantization variants (e.g., iq4_nl) could not be run on CUDA, so they are excluded from the comparison.
Related Articles
The Moonwell Oracle Exploit: How AI-Assisted 'Vibe Coding' Turned cbETH Into a $1.12 Token and Cost $1.78M
Dev.to
How CVE-2026-25253 exposed every OpenClaw user to RCE — and how to fix it in one command
Dev.to
Day 10: An AI Agent's Revenue Report — $29, 25 Products, 160 Tweets
Dev.to
Does Synthetic Data Generation of LLMs Help Clinical Text Mining?
Dev.to
What CVE-2026-25253 Taught Me About Building Safe AI Assistants
Dev.to