So I'm running the below and I've seen guys run this setup with TurboQuant_plus and get 35 tokens/second. I find the speeds I'm getting acceptable but if I could hit 30-35 I'd be soooooo happy. Any Advice on the configs?
Okay I'm running two variant of Llama, the standard one and TheTom's TurboQuant_plus with Qwen3.6-35B-A3B-UD-IQ4_XS
Hardware: MSI Stealth 13v - i7-13620H (10 Core / 16 thread with 6 P-cores) - 64GB 5200 - 4TB NVMe
These are the configs I'm using:
[1] Qwen 3.6 35B MoE ───────────────────────────────
Model: Qwen3.6-35B-A3B-UD-IQ4_XS.gguf
Context: 40,960 tokens
GPU: NGL 99 — hybrid MoE (35 expert layers in RAM, rest on GPU)
K cache: q8_0 (protected — Qwen arch is K-sensitive)
V cache: q4_0 (V compression lossless per asymmetric KV paper)
Flash: on | Batch: -b 2048 -ub 2048
Extras: --reasoning-budget 4096 | -np 1 | --cache-ram 0
LLAMA_CHAT_TEMPLATE_KWARGS={"preserve_thinking":true}
Speed: ~25 t/s simple / ~17 t/s heavy thinking | VRAM: ~7.0 GB
Use: OpenCode default, speed-priority tasks
[2] Qwen 3.6 35B MoE ───────────────────────────────
Model: Qwen3.6-35B-A3B-UD-IQ4_XS.gguf
Context: 196,608 tokens ← confirmed 6.8 GB at this size
GPU: NGL 99 — full CPU MoE (-cmoe, all 256 experts in RAM)
K cache: q8_0 (protected)
V cache: turbo3 (3.125 bpv — partial split causes /// with turbo, full CPU is stable)
Flash: on | Batch: -b 2048 -ub 2048
Extras: --reasoning-budget 4096 | -np 1 | --cache-ram 0
Speed: ~19-21 t/s | VRAM: 6.8 GB
Quality: Indistinguishable from Non-Quant on tested tasks
Use: Long-context work, when VRAM headroom needed
I gave the same prompt to each, a somewhat complicated math problem and told it to write a python class estimator for a specific task in commercial construction.
Then I compare the results and ran the code through Claude Code.
- Standard (Non-Quant) took 5min 41s at 17.55 t/s and wrote 166 lines of code.
- The TurboQuant_plus version took 4min 35s at 19.43 and wrote 104 lines of code.
┌──────────────────┬─────────────────┬────────────┐
│ │ Mega (Standard) │ TurboQuant │
├──────────────────┼─────────────────┼────────────┤
│ VRAM │ 7.0 GB │ 6.8 GB │
├──────────────────┼─────────────────┼────────────┤
│ Context │ 40k │ 192k │
├──────────────────┼─────────────────┼────────────┤
│ Tokens generated │ 5,988 │ 5,359 │
├──────────────────┼─────────────────┼────────────┤
│ Time │ 5min 41s │ 4min 35s │
├──────────────────┼─────────────────┼────────────┤
│ t/s │ 17.55 │ 19.43 │
└──────────────────┴─────────────────┴────────────┘
I ran the code through Claude Code just to compare and both of them are perfectly acceptable, but the TurboQuant code was 2-3% more accurate. That doesn't sound like a lot, but in this case it had to do with how a specific fastener quantity is calculated and could be expensive IRL. If I'm being totally honest its a extremely small error but it's still there.
So not only did the TurboQuant give me 20% faster results, the results were as accurate or better than the standard version AND I get a 192K context window. For reference I ran it at 262k but it hits 7.8GB VRAM and thats too on the edge for me.
Overall perfectly acceptable for my hardware, but if there's any way to get more tokens/second, I'd love to hear it. Relatively new to Llama been using ollama and LMStudio for the most part.
[link] [comments]


