I’ve been experimenting with TurboQuant KV cache quantization in llama.cpp (CPU + Metal) on Gemma 4 26B A4B-it Q4_K_M on an Apple M4 Pro 48GB, and the results look surprisingly strong.
Gemma 4 findings
On Gemma 4, QJL seems to work well, and FWHT as a structured rotation substitute also looks like a good fit for the large attention heads (dk=256/512).
My benchmark results:
- tq3j/q4_0: 37/37 on quality tests, 8/8 on NIAH
- tq2j/q4_0: 36/37, with the only miss being an empty response
- +34% faster than q4_0/q4_0 at 131K context
- TurboQuant overtakes q4_0 from 4K context onward
So on this setup, ~3.1 bits per K channel gets near-zero accuracy loss with a meaningful long-context speedup.
What’s also interesting is that this looks better than the public Gemma 4 fork results I’ve seen so far. In the linked 512-d Gemma 4 experiments, 512-WHT + global norm reaches 31/65, while the TBQP3 512 + QJL variants land around 23–28/65. That’s a very different outcome from what I’m seeing with the Metal implementation above.
Also worth noting: I’m not using Gemma 4 PPL right now, because PPL seems unreliable / broken there in llama.cpp at the moment, so for Gemma 4 I’m judging mostly from direct quality evals, NIAH, and long-context speed.
Separate result: Qwen PPL
Separately from the Gemma 4 work, I also have a per-layer / per-channel outlier-aware adaptive K quantization setup for Qwen2.5 / Qwen3.
Those results seem to beat current public fork-style implementations on PPL at comparable bpv:
- Qwen2.5 1.5B: 11.514 vs q8_0 11.524 at 6.21 bpv
- Qwen2.5 7B: 8.927 vs q8_0 8.949 at 6.41 bpv
- Qwen3 8B: 10.848, within CI of both f16 and q8_0, at 5.125 bpv
That makes me think a lot of the gap is in per-layer allocation / calibration / outlier handling, not just in the base quantizer.
I also did some per-layer variance analysis on Gemma 4, and the spread differs a lot across layers, so there’s probably still room to improve further with mixed per-layer K types instead of one fixed recipe everywhere.
Gemma 4 benchmarks / details:
https://github.com/andrei-ace/llama.cpp/tree/turboquant-gemma/benches/tq-metal
Qwen per-layer / outlier-aware PPL results:
https://github.com/ggml-org/llama.cpp/discussions/21297
Gemma 4 comparison point in the TurboQuant thread:
https://github.com/ggml-org/llama.cpp/discussions/20969#discussioncomment-16450839
[link] [comments]