Qwen 3.6-35B-A3B KV cache part 2: PPL, KL divergence, asymmetric K/V, 64K row on M5 Max

Reddit r/LocalLLaMA / 4/30/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The post reports overnight follow-up KV-cache experiments for Qwen 3.6-35B-A3B on an M5 Max, measuring perplexity (PPL), KL divergence vs f16, and top-1 token agreement on wikitext-2 at a 4K context length.
Using q8_0 KV shows essentially no quality regression at 4K (PPL change within stderr, very small KL, and 98.6% top-1 agreement), while turbo3 and turbo4 introduce progressively larger quality/consistency losses.
Turbo3 causes about a 1% PPL increase and ~5 percentage points lower top-token agreement compared with q8_0, and turbo4 lands in between, consistent with their compression levels.
A depth sweep for asymmetric K/V compression (e.g., q8_0 K with turbo4 V, and q8_0 K with turbo3 V) shows the best throughput for -ctk q8_0 -ctv turbo4, achieving q8_0-like prefill throughput at 256K and fitting cases where symmetric q8_0 fails due to OOM.
The author notes throughput gains for asymmetric schemes but calls for additional PPL runs on the asymmetric K/V setups to fully validate quality tradeoffs beyond KL/logit comparisons.

Followup to yesterday's post: https://www.reddit.com/r/LocalLLaMA/comments/1sy7srk/. Comments asked for perplexity, KL divergence, asymmetric K/V combos, and a 64K data point. Ran them overnight. Same M5 Max, same Qwen 3.6-35B-A3B Q8, same TheTom TurboQuant fork (feature/turboquant-kv-cache).

Quality (perplexity + KL divergence on wikitext-2)

For u/milpster and u/Karyo_Ten. Context size 4096, since the canonical 512 doesn't fill enough KV cache to surface cache-quantization effects. f16 saves the baseline logits via --kl-divergence-base, then each quant run computes KL against that.

Cache	PPL	KL vs f16	Top-1 token agreement
f16	5.7438	baseline	n/a
q8_0	5.7433	0.0016	98.64%
turbo3 (~4.9x)	5.8092	0.0199	93.93%
turbo4 (~3.8x)	5.7810	0.0131	95.28%

q8_0 KV is essentially free at this depth. The PPL delta is -0.0005, well inside the ±0.036 stderr. KL is 0.0016. The quantized cache picks the same top-1 token as f16 98.6% of the time. The worry from yesterday's comments was "what does this cost in quality." At 4k context, it's noise.

turbo3 costs about 1% PPL increase and 5 percentage points of top-token disagreement, with KL roughly 12x q8_0. turbo4 sits between, in line with its lower compression ratio. Quality cost scales with compression, no surprises.

Asymmetric K/V (depth sweep)

For u/Sabin_Stargem and my own untested caveat from yesterday. Decode tok/s, same llama-bench flags as the symmetric sweep:

Depth	q8_0 K / turbo4 V	q8_0 K / turbo3 V	f16 K / turbo4 V
0	82.9	81.8	72.8
8K	75.4	75.6	16.9
32K	66.0	63.2	8.6
128K	41.0	38.2	2.8
256K	27.1	25.0	skipped
512K	16.5	14.8	skipped

-ctk q8_0 -ctv turbo4 is the standout. At 256K it matches yesterday's symmetric q8_0 throughput (pp 128 vs 124, tg 27.1 vs 26.6), and it fits 512K where symmetric q8_0 OOM'd. So you get q8_0-grade prefill behavior with turbo4-grade context ceiling. Sabin's hypothesis that V compresses cheap and K compresses expensive looks right on the throughput side. Quality side I'd want a PPL run on the asym combos to fully close the loop.

-ctk q8_0 -ctv turbo3 does the same trick but with worse decode. Tighter V quant taxes the generation side more.

-ctk f16 -ctv turbo4 is broken on this fork on Metal. The FlashAttention kernel doesn't fast-path that K/V type combination, so it falls back to a generic dequant-then-attention path. At 8K it's 34x slower than symmetric f16. At 128K it's 78x slower (4.1 t/s pp). Cells past 128K weren't worth completing. Don't use this combo.

64K row

For u/ocarina24. Filling the 32K to 128K gap on the prefill curve. All seven configs at depth 65536:

Cache	pp512	tg128
f16 (symmetric)	602.0	59.8
q8_0 (symmetric)	479.2	57.9
turbo3 (symmetric)	469.8	49.9
turbo4 (symmetric)	418.0	55.2
q8_0 K / turbo4 V	468.2	55.9
q8_0 K / turbo3 V	465.6	52.6
f16 K / turbo4 V	8.3	4.9

Two things stood out. First, the prefill curves are nearly converged at 64K. turbo3 (470) is within 2% of q8_0 (479). Yesterday's data showed turbo3 actually pulling ahead by 128K (253 vs 245), so the bandwidth-bound regime kicks in somewhere between 64K and 128K on this hardware. Earlier than I'd estimated. Second, the asymmetric q8_0/turbo* rows track symmetric q8_0 prefill closely at this depth too. Same story as the deeper rows.

What I take from all of this

Updated cache-type recommendation from yesterday:

Coding agents (deep context, lots of generated tokens): -ctk q8_0 -ctv turbo4 is the new pick. q8_0 quality on K, turbo4 savings on V, fits 512K.
RAG or batch QA (heavy prefill, short answers): same combo, or symmetric turbo3 at the deepest depths.
Pure 1M context maxing: still symmetric turbo3, only thing that fits.
Short interactive (under 32K): f16 if memory allows, else q8_0. Quality cost is genuinely zero.

Caveats

PPL was at 4096 context. Quality at deeper contexts, where the cache is more saturated, might tell a different story.
Asymmetric quality numbers are still pending. Throughput data argues V-side compression is cheap, but I haven't measured KL or PPL on the asym combos yet.
f16 K + turbo* is a kernel fallback on this fork on Metal. Verify before assuming this combo works on other backends.
Single hardware data point (M5 Max, 128 GB). Crossover depths and the prefill/decode split likely shift with memory bandwidth and GPU core count.

Still in flight

u/GCoderDCoder. Aider Polyglot pass for f16, turbo3, and turbo4 (q8_0 was 62.2% earlier this week, n=225). Each Polyglot run is about 6 to 12 hours, so it's a few nights serial. Running later this week.
u/noctrex. Wider quant types (q4_0, q4_1, iq4_nl, q5_0, q5_1) extending the depth sweep. After Aider.
u/Able_Librarian1569. Same sweep on a non-MoE non-DeltaNet model for transferability. After the wider quant types.

Same offer as yesterday. If you have non-M5-Max Apple Silicon and want to run a slice of this matrix, drop your numbers below or DM me. Happy to send the raw llama-bench and llama-perplexity output for anyone who wants to dig into the per-cell stats.

Full writeup with the methodology and the per-cell stderr numbers: https://llmkube.com/blog/turboquant-m5-max-quality-and-asymmetric

submitted by /u/Defilan
[link] [comments]

Remote agents in Vibe. Powered by Mistral Medium 3.5.ProductIntroducing Mistral Medium 3.5, remote coding agents in Vibe, plus new Work mode in Le Chat for complex tasks.

Mistral AI Blog

1.14.4a2

CrewAI Releases

langchain-perplexity==1.2.0

LangChain Releases

I built a sovereign voice layer that routes to 11 AI providers — here's the architecture

Dev.to

Are LLMs Capable of Original Thought?: A Critical Analysis of Generative AI Creativity

Dev.to

Qwen 3.6-35B-A3B KV cache part 2: PPL, KL divergence, asymmetric K/V, 64K row on M5 Max

Key Points

Related Articles

Remote agents in Vibe. Powered by Mistral Medium 3.5.ProductIntroducing Mistral Medium 3.5, remote coding agents in Vibe, plus new Work mode in Le Chat for complex tasks.

1.14.4a2

langchain-perplexity==1.2.0

I built a sovereign voice layer that routes to 11 AI providers — here's the architecture

Are LLMs Capable of Original Thought?: A Critical Analysis of Generative AI Creativity

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer