Qwen 3.6-35B-A3B KV cache part 2: PPL, KL divergence, asymmetric K/V, 64K row on M5 Max

Reddit r/LocalLLaMA / 4/30/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

Key Points

  • The post reports overnight follow-up KV-cache experiments for Qwen 3.6-35B-A3B on an M5 Max, measuring perplexity (PPL), KL divergence vs f16, and top-1 token agreement on wikitext-2 at a 4K context length.
  • Using q8_0 KV shows essentially no quality regression at 4K (PPL change within stderr, very small KL, and 98.6% top-1 agreement), while turbo3 and turbo4 introduce progressively larger quality/consistency losses.
  • Turbo3 causes about a 1% PPL increase and ~5 percentage points lower top-token agreement compared with q8_0, and turbo4 lands in between, consistent with their compression levels.
  • A depth sweep for asymmetric K/V compression (e.g., q8_0 K with turbo4 V, and q8_0 K with turbo3 V) shows the best throughput for -ctk q8_0 -ctv turbo4, achieving q8_0-like prefill throughput at 256K and fitting cases where symmetric q8_0 fails due to OOM.
  • The author notes throughput gains for asymmetric schemes but calls for additional PPL runs on the asymmetric K/V setups to fully validate quality tradeoffs beyond KL/logit comparisons.

Followup to yesterday's post: https://www.reddit.com/r/LocalLLaMA/comments/1sy7srk/. Comments asked for perplexity, KL divergence, asymmetric K/V combos, and a 64K data point. Ran them overnight. Same M5 Max, same Qwen 3.6-35B-A3B Q8, same TheTom TurboQuant fork (feature/turboquant-kv-cache).

Quality (perplexity + KL divergence on wikitext-2)

For u/milpster and u/Karyo_Ten. Context size 4096, since the canonical 512 doesn't fill enough KV cache to surface cache-quantization effects. f16 saves the baseline logits via --kl-divergence-base, then each quant run computes KL against that.

Cache PPL KL vs f16 Top-1 token agreement
f16 5.7438 baseline n/a
q8_0 5.7433 0.0016 98.64%
turbo3 (~4.9x) 5.8092 0.0199 93.93%
turbo4 (~3.8x) 5.7810 0.0131 95.28%

q8_0 KV is essentially free at this depth. The PPL delta is -0.0005, well inside the ±0.036 stderr. KL is 0.0016. The quantized cache picks the same top-1 token as f16 98.6% of the time. The worry from yesterday's comments was "what does this cost in quality." At 4k context, it's noise.

turbo3 costs about 1% PPL increase and 5 percentage points of top-token disagreement, with KL roughly 12x q8_0. turbo4 sits between, in line with its lower compression ratio. Quality cost scales with compression, no surprises.

Asymmetric K/V (depth sweep)

For u/Sabin_Stargem and my own untested caveat from yesterday. Decode tok/s, same llama-bench flags as the symmetric sweep:

Depth q8_0 K / turbo4 V q8_0 K / turbo3 V f16 K / turbo4 V
0 82.9 81.8 72.8
8K 75.4 75.6 16.9
32K 66.0 63.2 8.6
128K 41.0 38.2 2.8
256K 27.1 25.0 skipped
512K 16.5 14.8 skipped

-ctk q8_0 -ctv turbo4 is the standout. At 256K it matches yesterday's symmetric q8_0 throughput (pp 128 vs 124, tg 27.1 vs 26.6), and it fits 512K where symmetric q8_0 OOM'd. So you get q8_0-grade prefill behavior with turbo4-grade context ceiling. Sabin's hypothesis that V compresses cheap and K compresses expensive looks right on the throughput side. Quality side I'd want a PPL run on the asym combos to fully close the loop.

-ctk q8_0 -ctv turbo3 does the same trick but with worse decode. Tighter V quant taxes the generation side more.

-ctk f16 -ctv turbo4 is broken on this fork on Metal. The FlashAttention kernel doesn't fast-path that K/V type combination, so it falls back to a generic dequant-then-attention path. At 8K it's 34x slower than symmetric f16. At 128K it's 78x slower (4.1 t/s pp). Cells past 128K weren't worth completing. Don't use this combo.

64K row

For u/ocarina24. Filling the 32K to 128K gap on the prefill curve. All seven configs at depth 65536:

Cache pp512 tg128
f16 (symmetric) 602.0 59.8
q8_0 (symmetric) 479.2 57.9
turbo3 (symmetric) 469.8 49.9
turbo4 (symmetric) 418.0 55.2
q8_0 K / turbo4 V 468.2 55.9
q8_0 K / turbo3 V 465.6 52.6
f16 K / turbo4 V 8.3 4.9

Two things stood out. First, the prefill curves are nearly converged at 64K. turbo3 (470) is within 2% of q8_0 (479). Yesterday's data showed turbo3 actually pulling ahead by 128K (253 vs 245), so the bandwidth-bound regime kicks in somewhere between 64K and 128K on this hardware. Earlier than I'd estimated. Second, the asymmetric q8_0/turbo* rows track symmetric q8_0 prefill closely at this depth too. Same story as the deeper rows.

What I take from all of this

Updated cache-type recommendation from yesterday:

  • Coding agents (deep context, lots of generated tokens): -ctk q8_0 -ctv turbo4 is the new pick. q8_0 quality on K, turbo4 savings on V, fits 512K.
  • RAG or batch QA (heavy prefill, short answers): same combo, or symmetric turbo3 at the deepest depths.
  • Pure 1M context maxing: still symmetric turbo3, only thing that fits.
  • Short interactive (under 32K): f16 if memory allows, else q8_0. Quality cost is genuinely zero.

Caveats

  • PPL was at 4096 context. Quality at deeper contexts, where the cache is more saturated, might tell a different story.
  • Asymmetric quality numbers are still pending. Throughput data argues V-side compression is cheap, but I haven't measured KL or PPL on the asym combos yet.
  • f16 K + turbo* is a kernel fallback on this fork on Metal. Verify before assuming this combo works on other backends.
  • Single hardware data point (M5 Max, 128 GB). Crossover depths and the prefill/decode split likely shift with memory bandwidth and GPU core count.

Still in flight

  • u/GCoderDCoder. Aider Polyglot pass for f16, turbo3, and turbo4 (q8_0 was 62.2% earlier this week, n=225). Each Polyglot run is about 6 to 12 hours, so it's a few nights serial. Running later this week.
  • u/noctrex. Wider quant types (q4_0, q4_1, iq4_nl, q5_0, q5_1) extending the depth sweep. After Aider.
  • u/Able_Librarian1569. Same sweep on a non-MoE non-DeltaNet model for transferability. After the wider quant types.

Same offer as yesterday. If you have non-M5-Max Apple Silicon and want to run a slice of this matrix, drop your numbers below or DM me. Happy to send the raw llama-bench and llama-perplexity output for anyone who wants to dig into the per-cell stats.

Full writeup with the methodology and the per-cell stderr numbers: https://llmkube.com/blog/turboquant-m5-max-quality-and-asymmetric

submitted by /u/Defilan
[link] [comments]