Best Llama Config for Turboquant_Plus? (Stats below)

Reddit r/LocalLLaMA / 5/5/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The author is benchmarking different local Llama/Qwen configurations on an MSI Stealth 13v laptop to improve generation speed (targeting 30–35 tokens/second) when using TurboQuant_plus.
Two setups are tested: a hybrid MoE configuration that places some expert layers in RAM and the rest on GPU (NGL 99) versus a full CPU MoE configuration with all 256 experts in RAM.
With a shorter 40k context, the hybrid MoE run achieves about 25 t/s (and ~17 t/s under heavy thinking) while reporting roughly 7.0GB VRAM usage.
With a longer 192k context, the full CPU MoE setup reaches about 19–21 t/s and uses about 6.8GB VRAM, while the author states output quality is effectively indistinguishable from the non-quantized model on tested tasks.
In a comparable math-to-Python coding test, the TurboQuant_plus variant finishes faster (4m 35s vs 5m 41s) and produces fewer lines of code, with Claude Code used to assess the generated results.

So I'm running the below and I've seen guys run this setup with TurboQuant_plus and get 35 tokens/second. I find the speeds I'm getting acceptable but if I could hit 30-35 I'd be soooooo happy. Any Advice on the configs?

Okay I'm running two variant of Llama, the standard one and TheTom's TurboQuant_plus with Qwen3.6-35B-A3B-UD-IQ4_XS

Hardware: MSI Stealth 13v - i7-13620H (10 Core / 16 thread with 6 P-cores) - 64GB 5200 - 4TB NVMe

These are the configs I'm using:

[1] Qwen 3.6 35B MoE ───────────────────────────────

Model: Qwen3.6-35B-A3B-UD-IQ4_XS.gguf

Context: 40,960 tokens

GPU: NGL 99 — hybrid MoE (35 expert layers in RAM, rest on GPU)

K cache: q8_0 (protected — Qwen arch is K-sensitive)

V cache: q4_0 (V compression lossless per asymmetric KV paper)

Flash: on | Batch: -b 2048 -ub 2048

Extras: --reasoning-budget 4096 | -np 1 | --cache-ram 0

LLAMA_CHAT_TEMPLATE_KWARGS={"preserve_thinking":true}

Speed: ~25 t/s simple / ~17 t/s heavy thinking | VRAM: ~7.0 GB

Use: OpenCode default, speed-priority tasks

[2] Qwen 3.6 35B MoE ───────────────────────────────

Model: Qwen3.6-35B-A3B-UD-IQ4_XS.gguf

Context: 196,608 tokens ← confirmed 6.8 GB at this size

GPU: NGL 99 — full CPU MoE (-cmoe, all 256 experts in RAM)

K cache: q8_0 (protected)

V cache: turbo3 (3.125 bpv — partial split causes /// with turbo, full CPU is stable)

Flash: on | Batch: -b 2048 -ub 2048

Extras: --reasoning-budget 4096 | -np 1 | --cache-ram 0

Speed: ~19-21 t/s | VRAM: 6.8 GB

Quality: Indistinguishable from Non-Quant on tested tasks

Use: Long-context work, when VRAM headroom needed

I gave the same prompt to each, a somewhat complicated math problem and told it to write a python class estimator for a specific task in commercial construction.

Then I compare the results and ran the code through Claude Code.

Standard (Non-Quant) took 5min 41s at 17.55 t/s and wrote 166 lines of code.
The TurboQuant_plus version took 4min 35s at 19.43 and wrote 104 lines of code.

┌──────────────────┬─────────────────┬────────────┐

│ │ Mega (Standard) │ TurboQuant │

├──────────────────┼─────────────────┼────────────┤

│ VRAM │ 7.0 GB │ 6.8 GB │

├──────────────────┼─────────────────┼────────────┤

│ Context │ 40k │ 192k │

├──────────────────┼─────────────────┼────────────┤

│ Tokens generated │ 5,988 │ 5,359 │

├──────────────────┼─────────────────┼────────────┤

│ Time │ 5min 41s │ 4min 35s │

├──────────────────┼─────────────────┼────────────┤

│ t/s │ 17.55 │ 19.43 │

└──────────────────┴─────────────────┴────────────┘

I ran the code through Claude Code just to compare and both of them are perfectly acceptable, but the TurboQuant code was 2-3% more accurate. That doesn't sound like a lot, but in this case it had to do with how a specific fastener quantity is calculated and could be expensive IRL. If I'm being totally honest its a extremely small error but it's still there.

So not only did the TurboQuant give me 20% faster results, the results were as accurate or better than the standard version AND I get a 192K context window. For reference I ran it at 262k but it hits 7.8GB VRAM and thats too on the edge for me.

Overall perfectly acceptable for my hardware, but if there's any way to get more tokens/second, I'd love to hear it. Relatively new to Llama been using ollama and LMStudio for the most part.

submitted by /u/Snoo_81913
[link] [comments]