Qwen 3.6-35B-A3B on dual 5060 Ti with --cpu-moe: 21.7 tok/s at 90K context, with benchmarks vs dense 3.5 and Coder variant

Reddit r/LocalLLaMA / 4/18/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

Benchmarks on two consumer RTX 5060 Ti show that Qwen 3-Coder-30B-A3B benefits significantly from hybrid CPU offloading, achieving much higher generation speed and lower P50 latency than the dense Qwen 3.5-27B setup.
Qwen 3.6-35B-A3B with --cpu-moe still delivers strong generation throughput (21.7 tok/s at 90K context), but it is slower than the Coder variant due to higher total parameter traffic through system RAM.
Quality tradeoffs are substantial: the 3.6 model shows notable gains on SWE-bench Verified and Terminal-Bench 2.0, making it preferable for agentic and multi-step workloads, while the Coder variant remains better for fast code completion.
Dense prompting is far faster than hybrid runs (prompt processing 160 tok/s vs roughly 30–95 tok/s), and the advantage of hybrid offloading increases during generation because PCIe round trips occur only for active experts.
Attempts to scale to 131K context by combining --cpu-moe with a TurboQuant KV cache build crashed (exit code 139), suggesting TurboQuant’s fused kernels are not yet compatible with the Qwen 3 MoE architecture; stock llama.cpp is currently the safer option at 90K.

Qwen 3.6 dropped yesterday and I wanted to see if hybrid offloading actually earns its keep on this hardware. My box is two RTX 5060 Ti (32GB VRAM total) with 64GB system RAM. Not a workstation card in sight.

I ran the same bench harness across three configs back to back so the comparison is at least fair on the hardware side. Stock ghcr.io/ggml-org/llama.cpp:server-cuda13 for the MoE runs, our TurboQuant build for the dense. Sequential: 10 iterations, 128 max tokens, 2 warmup. Stress: 4 concurrent workers, 256 max tokens, 5 min. Prompt is the same for all.

The MoE flags:

--cpu-moe --no-kv-offload --cache-type-k q8_0 --cache-type-v q8_0 --ctx-size 90112 --flash-attn on --n-gpu-layers 99 --split-mode layer --tensor-split 1,1

Results:

Model / Config	Generation	P50 latency	Stress (4 concurrent)
Qwen 3.5-27B dense (full GPU, TurboQuant KV)	18.3 tok/s	7,196 ms	10.4 tok/s, 52 req/5min
Qwen 3-Coder-30B-A3B (--cpu-moe hybrid)	31.1 tok/s	2,286 ms	12.0 tok/s, 113 req/5min
Qwen 3.6-35B-A3B (--cpu-moe hybrid)	21.7 tok/s	6,160 ms	6.8 tok/s, 38 req/5min

A few things I did not expect.

The jump from dense 3.5 to Coder hybrid is basically free performance if you have a MoE model. 70% faster generation on the same two GPUs, P50 latency cut to a third. I always knew hybrid offloading was useful on paper but seeing the raw numbers side by side made me wish I had tried it sooner.

Qwen 3.6 is slower than the Coder variant even though both are 3B active. The extra 5B of total params means more expert weight traffic through system RAM per token. But the quality delta is not subtle, 73.4% vs 50.3% on SWE-bench Verified and +11 points on Terminal-Bench 2.0. For anything agentic or multi-step I am grabbing 3.6. For fast code completion the Coder is still the move.

Dense wins prompt processing by a mile, 160 tok/s vs 30-95 for the hybrid runs. If you live in long-context RAG or heavy prompt ingestion that is not going away. Generation speed is where hybrid pulls ahead because the PCIe round trip only happens for the active experts.

Tried pushing further. Wanted to combine --cpu-moe with our TurboQuant KV cache build (tbqp3/tbq3) to get to 131K context with a much smaller KV footprint. Crashed on warmup, exit code 139. Stack pointed at fused Gated Delta Net kernels in the TurboQuant fork. Looks like that optimization path has not been updated for the Qwen 3 MoE architecture yet. Stock llama.cpp with q8_0 at 90K is fine for now.

What I actually used it for once it was running: gave it a spec doc for the next feature of the K8s operator I wrote to deploy it and let it rip overnight. 56 tool calls, 100% success, 9 unit tests, all verification commands green. Merge-ready PR when I woke up. The model I deployed ended up shipping the operator's next feature. Bit of a recursion moment. Full writeup here if you want the longer version.

Happy to share more of the config, the bench harness, or the raw numbers if anyone wants them.

submitted by /u/Defilan
[link] [comments]