Qwen 3.6-35B-A3B KV cache bench: f16 vs q8_0 vs turbo3 vs turbo4 from 0 to 1M context on M5 Max

Reddit r/LocalLLaMA / 4/29/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The author benchmarked Qwen 3.6-35B-A3B KV cache behavior across context lengths from 0 to 1M tokens on an M5 Max using a llama.cpp turboquant KV-cache fork, comparing f16, q8_0, turbo3, and turbo4.
For generation throughput, f16 slightly leads at very short contexts, but it runs out of memory earlier than the quantized caches, while q8_0 and turbo variants continue until further depths.
As context length increases (around/after ~100K tokens on this hardware), the workload becomes bandwidth-bound and smaller KV caches (notably turbo3) can outperform larger 8-bit caches during prompt prefill.
A key finding is that turbo3 vs turbo4 trade off by phase: turbo3 is faster for prefill at 256K and turbo4 can be faster for decode, with the decode advantage widening at larger depths (e.g., 512K).

Took TheTom's TurboQuant Metal fork of llama.cpp (github.com/TheTom/llama-cpp-turboquant, the feature/turboquant-kv-cache branch) and ran a depth sweep on Qwen 3.6-35B-A3B Q8. TheTom had already published M5 Max numbers up to 32K. I wanted to see what the curves looked like once you push them.

Hardware: MacBook Pro M5 Max, 128 GB unified memory. Built the fork with cmake -B build -DGGML_METAL=ON. llama-bench, 3 reps per cell, flash-attn on, mlock on, 8 hours wall-clock overnight.

Cache types: f16, q8_0, turbo3, turbo4. Symmetric K and V (-ctk and -ctv set to the same type). Depths from 0 to 1M tokens.

Generation throughput (tok/s):

Depth	f16	q8_0	turbo3	turbo4
0	89.4	87.4	79.5	79.7
8K	84.2	79.2	72.2	71.2
32K	72.6	67.8	61.5	61.8
128K	44.4	40.7	36.0	37.7
256K	OOM	26.6	22.9	25.5
512K	OOM	OOM	13.3	16.0
1M	OOM	OOM	6.5	OOM

Prompt processing throughput (tok/s):

Depth	f16	q8_0	turbo3	turbo4
0	2962	2948	2904	2854
8K	2098	1623	1653	1439
32K	1063	802	784	678
128K	321	245	253	206
256K	OOM	124	128	101
512K	OOM	OOM	66	56
1M	OOM	OOM	30	OOM

What stood out

At depth 0 the standard story holds. f16 wins by a hair on prefill, turbo3 is about 10% slower on decode. Most write-ups stop here.

At 128K the 3-bit cache catches up to the 8-bit cache on prefill (turbo3 253 vs q8_0 245). Smaller cache means less bandwidth pressure during attention. The bandwidth-bound regime favors turbo3 once contexts grow past about 100K on this hardware.

The bigger surprise was turbo3 vs turbo4. They split by phase. At 256K turbo3 wins prefill +27% over turbo4 (128 vs 101 t/s), but turbo4 wins decode +11% over turbo3 (25.5 vs 22.9 t/s). At 512K the decode gap widens to +20% (turbo4 16.0 vs turbo3 13.3). Different bottleneck regimes during prefill and decode mean the right cache type depends on the workload.

What I take from that:

Coding agents (deep context, lots of generated tokens per turn): turbo4
RAG or batch QA (heavy prefill, short answers): turbo3
Pure context window maxing (1M): turbo3, only one that fits
Short interactive (under 32K): f16 if it fits, else q8_0

The 1M cell on turbo3 was 6.5 tok/s decode. Not chat-speed but workable for overnight agentic batch jobs. Memory at 1M came to about 89 GB (37 GB for the weights, ~52 GB for the KV cache), fits in 128 GB with the OS reserve.

Caveats

This is one M5 Max. The crossover point and the prefill/decode split likely shift with memory bandwidth and GPU core count. I tested symmetric K and V combinations only. Saw a thread suggesting asymmetric (-ctk q8_0 -ctv turbo4) as a default which I haven't benched yet. TheTom's fork is research-grade and not yet upstream in llama.cpp main, so rebases will be needed when upstream moves.

If you have non-M5-Max Apple Silicon (M2 Pro/Max, M3 Ultra, M4 Max) and want to run the same sweep, drop your numbers below or DM me. The curves likely shift with hardware and a second data point would help characterize the crossover.

Full grid and methodology in a writeup if you want the longer version: https://llmkube.com/blog/turboquant-m5-max-long-context

submitted by /u/Defilan
[link] [comments]

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/29DailyView insight →

Black Hat USA

AI Business

How are LLMs 'corrected' when users identify them spreading misinformation or saying something harmful?

Reddit r/artificial

The Landing: Portable Payload for AI Systems

Reddit r/artificial

I Made a CLI That Yells at Your Code Until It Gets an A

Dev.to

BizNode Pro: run up to 5 independent Telegram bots, each with its own identity, knowledge base, and AI persona

Dev.to

Qwen 3.6-35B-A3B KV cache bench: f16 vs q8_0 vs turbo3 vs turbo4 from 0 to 1M context on M5 Max

Key Points

💡 Insights using this article

Related Articles

Black Hat USA

How are LLMs 'corrected' when users identify them spreading misinformation or saying something harmful?

The Landing: Portable Payload for AI Systems

I Made a CLI That Yells at Your Code Until It Gets an A

BizNode Pro: run up to 5 independent Telegram bots, each with its own identity, knowledge base, and AI persona

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer