Spent an evening dialing in Qwen3.6-35B-A3B on consumer hardware. Fun side note: I had Claude Opus 4.7 (just the $20 sub) build the config, launch the servers in the background, run the benchmarks, read the VRAM splits from the llama.cpp logs, and iterate on the tuning — basically did the whole thing autonomously. I just told it what hardware I have and what I wanted to run.
Sharing because the common --cpu-moe advice is leaving 54% of your speed on the table on 16GB GPUs.
Hardware
- GPU: RTX 5070 Ti (16GB GDDR7, Blackwell)
- CPU: Ryzen 9800X3D (96MB L3 V-Cache)
- RAM: 32GB DDR5
- Stack: llama.cpp b8829 (CUDA 13.1, Windows x64)
- Model:
unsloth/Qwen3.6-35B-A3B-GGUF—UD-Q4_K_M(22.1 GB)
The finding — --cpu-moe vs --n-cpu-moe N
Everyone’s using --cpu-moe which pushes ALL MoE experts to CPU. On a 16GB GPU with a 22GB MoE model that means only ~1.9 GB of your VRAM gets used — the other ~12 GB sits idle.
--n-cpu-moe N keeps experts of the first N layers on CPU and puts the rest on GPU. With N=20 on a 40-layer model, the split uses VRAM properly.
Benchmarks (300-token generation, Q4_K_M)
| Config | Gen t/s | Prompt t/s | VRAM used |
|---|---|---|---|
--cpu-moe (baseline) | 51.2 | 87.9 | 3.5 GB |
--n-cpu-moe 20 | 78.7 | 100.6 | 12.7 GB |
--n-cpu-moe 20 + -np 1 + 128K ctx | 79.3 | 135.8 | 13.2 GB |
+54% generation speed, +54% prompt speed vs. naive --cpu-moe. Jumping to 128K context is essentially free thanks to -np 1 dropping recurrent-state memory.
Startup command that works
llama-server.exe ^ -m "Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" ^ --n-cpu-moe 20 ^ -ngl 99 ^ -np 1 ^ -fa on ^ -ctk q8_0 -ctv q8_0 ^ -c 131072 ^ --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 ^ --presence-penalty 0.0 --repeat-penalty 1.0 ^ --reasoning-budget -1 ^ --host 0.0.0.0 --port 8080 That’s Unsloth’s “Precise Coding” sampling preset. For general use: --temp 1.0 --presence-penalty 1.5.
Gotchas I hit (well, that Opus hit and fixed)
-npdefaults to auto=4 slots. Wastes memory on recurrent state (~190 MB). Set-np 1for single-user setups (OpenCode etc.).--fit-targetdoesn’t help here —-ngl 99+--n-cpu-moe Nalready gives you deterministic control.-ctk q8_0 -ctv q8_0is nearly lossless and halves your KV cache vs fp16. 128K ctx only costs 1.36 GB VRAM.- Qwen3.6 is a hybrid architecture — only 10 layers are standard attention, the other 40 are Gated Delta Net (recurrent). That’s why KV memory is so small.
How to tune N for your GPU
Each MoE layer on GPU costs ~530 MB VRAM. Non-MoE weights are ~1.9 GB fixed. For a 40-layer model:
| GPU VRAM | Recommended N |
|---|---|
| 8 GB | stay with --cpu-moe |
| 12 GB | N=26 |
| 16 GB | N=20 (sweet spot) |
| 24 GB | N=8 (fits almost everything) |
Start conservative, watch VRAM during a long-context generation, then step N down by 2-3 until you have ~2 GB headroom.
TL;DR
Replace --cpu-moe with --n-cpu-moe 20, add -np 1, and you get 79 t/s + 128K context on a 5070 Ti. The 9800X3D’s V-Cache carries the CPU side effortlessly.
And Claude Opus 4.7 on the $20 Pro sub is genuinely good enough now to run this kind of hardware-tuning loop end-to-end — launch servers in background, parse logs, iterate — without hand-holding. Kind of wild.
Happy to test other configs if anyone wants comparisons.
[link] [comments]




