Qwen 3.6-35B-A3B KV cache bench: f16 vs q8_0 vs turbo3 vs turbo4 from 0 to 1M context on M5 Max

Reddit r/LocalLLaMA / 4/29/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The author benchmarked Qwen 3.6-35B-A3B KV cache behavior across context lengths from 0 to 1M tokens on an M5 Max using a llama.cpp turboquant KV-cache fork, comparing f16, q8_0, turbo3, and turbo4.
  • For generation throughput, f16 slightly leads at very short contexts, but it runs out of memory earlier than the quantized caches, while q8_0 and turbo variants continue until further depths.
  • As context length increases (around/after ~100K tokens on this hardware), the workload becomes bandwidth-bound and smaller KV caches (notably turbo3) can outperform larger 8-bit caches during prompt prefill.
  • A key finding is that turbo3 vs turbo4 trade off by phase: turbo3 is faster for prefill at 256K and turbo4 can be faster for decode, with the decode advantage widening at larger depths (e.g., 512K).

Took TheTom's TurboQuant Metal fork of llama.cpp (github.com/TheTom/llama-cpp-turboquant, the feature/turboquant-kv-cache branch) and ran a depth sweep on Qwen 3.6-35B-A3B Q8. TheTom had already published M5 Max numbers up to 32K. I wanted to see what the curves looked like once you push them.

Hardware: MacBook Pro M5 Max, 128 GB unified memory. Built the fork with cmake -B build -DGGML_METAL=ON. llama-bench, 3 reps per cell, flash-attn on, mlock on, 8 hours wall-clock overnight.

Cache types: f16, q8_0, turbo3, turbo4. Symmetric K and V (-ctk and -ctv set to the same type). Depths from 0 to 1M tokens.

Generation throughput (tok/s):

Depth f16 q8_0 turbo3 turbo4
0 89.4 87.4 79.5 79.7
8K 84.2 79.2 72.2 71.2
32K 72.6 67.8 61.5 61.8
128K 44.4 40.7 36.0 37.7
256K OOM 26.6 22.9 25.5
512K OOM OOM 13.3 16.0
1M OOM OOM 6.5 OOM

Prompt processing throughput (tok/s):

Depth f16 q8_0 turbo3 turbo4
0 2962 2948 2904 2854
8K 2098 1623 1653 1439
32K 1063 802 784 678
128K 321 245 253 206
256K OOM 124 128 101
512K OOM OOM 66 56
1M OOM OOM 30 OOM

What stood out

At depth 0 the standard story holds. f16 wins by a hair on prefill, turbo3 is about 10% slower on decode. Most write-ups stop here.

At 128K the 3-bit cache catches up to the 8-bit cache on prefill (turbo3 253 vs q8_0 245). Smaller cache means less bandwidth pressure during attention. The bandwidth-bound regime favors turbo3 once contexts grow past about 100K on this hardware.

The bigger surprise was turbo3 vs turbo4. They split by phase. At 256K turbo3 wins prefill +27% over turbo4 (128 vs 101 t/s), but turbo4 wins decode +11% over turbo3 (25.5 vs 22.9 t/s). At 512K the decode gap widens to +20% (turbo4 16.0 vs turbo3 13.3). Different bottleneck regimes during prefill and decode mean the right cache type depends on the workload.

What I take from that:

  • Coding agents (deep context, lots of generated tokens per turn): turbo4
  • RAG or batch QA (heavy prefill, short answers): turbo3
  • Pure context window maxing (1M): turbo3, only one that fits
  • Short interactive (under 32K): f16 if it fits, else q8_0

The 1M cell on turbo3 was 6.5 tok/s decode. Not chat-speed but workable for overnight agentic batch jobs. Memory at 1M came to about 89 GB (37 GB for the weights, ~52 GB for the KV cache), fits in 128 GB with the OS reserve.

Caveats

This is one M5 Max. The crossover point and the prefill/decode split likely shift with memory bandwidth and GPU core count. I tested symmetric K and V combinations only. Saw a thread suggesting asymmetric (-ctk q8_0 -ctv turbo4) as a default which I haven't benched yet. TheTom's fork is research-grade and not yet upstream in llama.cpp main, so rebases will be needed when upstream moves.

If you have non-M5-Max Apple Silicon (M2 Pro/Max, M3 Ultra, M4 Max) and want to run the same sweep, drop your numbers below or DM me. The curves likely shift with hardware and a second data point would help characterize the crossover.

Full grid and methodology in a writeup if you want the longer version: https://llmkube.com/blog/turboquant-m5-max-long-context

submitted by /u/Defilan
[link] [comments]