TurboQuant seems to work very well on Gemma 4 — and separately, per-layer outlier-aware K quantization is beating current public fork results on Qwen PPL

Reddit r/LocalLLaMA / 4/5/2026

💬 Opinion

Key Points

  • The author reports strong TurboQuant KV-cache quantization results in llama.cpp (CPU + Metal) on an Apple M4 Pro for Gemma 4 26B A4B-it, noting near-zero accuracy loss at ~3.1 bits per K channel and a meaningful long-context speedup.

I’ve been experimenting with TurboQuant KV cache quantization in llama.cpp (CPU + Metal) on Gemma 4 26B A4B-it Q4_K_M on an Apple M4 Pro 48GB, and the results look surprisingly strong.

Gemma 4 findings

On Gemma 4, QJL seems to work well, and FWHT as a structured rotation substitute also looks like a good fit for the large attention heads (dk=256/512).

My benchmark results:

  • tq3j/q4_0: 37/37 on quality tests, 8/8 on NIAH
  • tq2j/q4_0: 36/37, with the only miss being an empty response
  • +34% faster than q4_0/q4_0 at 131K context
  • TurboQuant overtakes q4_0 from 4K context onward

So on this setup, ~3.1 bits per K channel gets near-zero accuracy loss with a meaningful long-context speedup.

What’s also interesting is that this looks better than the public Gemma 4 fork results I’ve seen so far. In the linked 512-d Gemma 4 experiments, 512-WHT + global norm reaches 31/65, while the TBQP3 512 + QJL variants land around 23–28/65. That’s a very different outcome from what I’m seeing with the Metal implementation above.

Also worth noting: I’m not using Gemma 4 PPL right now, because PPL seems unreliable / broken there in llama.cpp at the moment, so for Gemma 4 I’m judging mostly from direct quality evals, NIAH, and long-context speed.

Separate result: Qwen PPL

Separately from the Gemma 4 work, I also have a per-layer / per-channel outlier-aware adaptive K quantization setup for Qwen2.5 / Qwen3.

Those results seem to beat current public fork-style implementations on PPL at comparable bpv:

  • Qwen2.5 1.5B: 11.514 vs q8_0 11.524 at 6.21 bpv
  • Qwen2.5 7B: 8.927 vs q8_0 8.949 at 6.41 bpv
  • Qwen3 8B: 10.848, within CI of both f16 and q8_0, at 5.125 bpv

That makes me think a lot of the gap is in per-layer allocation / calibration / outlier handling, not just in the base quantizer.

I also did some per-layer variance analysis on Gemma 4, and the spread differs a lot across layers, so there’s probably still room to improve further with mixed per-layer K types instead of one fixed recipe everywhere.
Gemma 4 benchmarks / details:

https://github.com/andrei-ace/llama.cpp/tree/turboquant-gemma/benches/tq-metal

Qwen per-layer / outlier-aware PPL results:

https://github.com/ggml-org/llama.cpp/discussions/21297

Gemma 4 comparison point in the TurboQuant thread:

https://github.com/ggml-org/llama.cpp/discussions/20969#discussioncomment-16450839

submitted by /u/Fearless-Wear8100
[link] [comments]