Qwen 35B-A3B is very usable with 12GB of VRAM

Reddit r/LocalLLaMA / 5/9/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

Key Points

  • Qwen3.6の35B MoEモデル「Qwen 35B-A3B」は、RTX 3060 12GBの環境でも実用レベルの推論性能が得られ、12GB VRAMが“現実的なサイズ”だと述べています。
  • MoEのGPU搭載率を左右するため、-ncmoeを下げ過ぎず適切に調整すると、プレフィル(prompt処理)を強くしつつ十分なコンテキスト長(16k/32kなど)を確保できます。
  • llama-benchのpp512計測を重視しており、指定プロファイル(例:-ncmoe 18、q8_0)では前処理が非常に速い結果が示されています。
  • 日常のコーディング用途では、32kコンテキスト+高速生成のバランスを取るプロファイル例(-c 32768、-ngl 999、-ncmoe 20など)を推奨し、16kの方がわずかに速いがVRAM限界に近い点も示しています。
  • また、プレーンデコードでの-ncmoeスイープやKVキャッシュ品質(q8_0推奨)に関するスイープ、さらにllama.cppのMTP(speculative decoding)分岐でのテストも紹介しています。

Hardware:

RTX 3060 12GB 32GB DDR4-3200 Windows CUDA 13.x 

Model:

Qwen3.6-35B-A3B-MTP-IQ4_XS.gguf 

The model is a 35B MoE, so -ncmoe matters a lot. Lower -ncmoe means more MoE blocks stay on GPU.

Main takeaway

12GB VRAM feels like a very practical size for this model. It lets you keep enough MoE blocks on GPU that plain decoding becomes quite strong, while still leaving room for useful context sizes like 16k/32k.

For prompt processing / prefill, I trust the llama-bench numbers more than llama-cli’s interactive Prompt: line, because llama-bench gives a cleaner pp512 measurement.

Best plain llama-bench result:

-ncmoe 18 -t 9 -ctk q8_0 -ctv q8_0 pp512: ~914 t/s tg128: ~46.8 t/s 

So raw prefill is very fast on this setup.

Best practical coding profile

For daily coding, I would use this:

llama-cli.exe ^ -m "Qwen3.6-35B-A3B-MTP-IQ4_XS.gguf" ^ -p "..." ^ -n 512 ^ -c 32768 ^ --temp 0 --top-k 1 ^ -ngl 999 -ncmoe 20 ^ -fa on ^ -ctk q8_0 -ctv q8_0 ^ --no-mmap ^ --no-jinja ^ -t 9 ^ --perf 

Result:

Context: 32k Prompt: ~88.9 t/s in llama-cli Generation: ~43.4 t/s VRAM free: ~273 MiB 

This is a nice balance: large enough context for coding, still fast, and not completely out of VRAM.

Faster 16k profile

-c 16384 -ncmoe 19 -ctk q8_0 -ctv q8_0 -t 9 

Result:

Prompt: ~91.5 t/s in llama-cli Generation: ~44.5 t/s VRAM free: ~37 MiB 

This is slightly faster, but very close to the VRAM edge.

MoE offload sweep

Plain decoding, q4 KV, -t 11:

-ncmoe 22: tg128 ~41.6 t/s -ncmoe 20: tg128 ~41.7 t/s -ncmoe 19: tg128 ~44.2 t/s -ncmoe 18: tg128 ~45.9 t/s -ncmoe 17: tg128 ~46.6 t/s -ncmoe 16: tg128 ~25.8 t/s <-- cliff / too aggressive 

So for plain decoding:

safe: -ncmoe 18 edge: -ncmoe 17 avoid: -ncmoe 16 

KV cache sweep

At -ncmoe 18, -t 11:

q4_0 KV: pp512 ~913 t/s, tg128 ~45.8 t/s q8_0 KV: pp512 ~915 t/s, tg128 ~45.9 t/s q5_0 KV: much slower mixed q8 K + q4/q5 V: much slower 

So on this GPU, q8 KV is basically free and preferable:

-ctk q8_0 -ctv q8_0 

MTP / speculative decoding

I also tested MTP with the llama.cpp MTP branch.

Best MTP command:

llama-cli.exe ^ -m "Qwen3.6-35B-A3B-MTP-IQ4_XS.gguf" ^ --spec-type mtp ^ -p "..." ^ -n 512 ^ --spec-draft-n-max 2 ^ -c 4096 ^ --temp 0 --top-k 1 ^ -ngl 999 -ncmoe 19 ^ -fa on ^ -ctk q4_0 -ctv q4_0 ^ --no-mmap ^ --no-jinja ^ -t 11 ^ --perf 

Result:

Generation: ~47.7 t/s 

MTP sweep:

-ncmoe 24, depth 2: ~43.8 t/s -ncmoe 20, depth 2: ~46.6 t/s -ncmoe 19, depth 2: ~47.7 t/s -ncmoe 18: failed / invalid vector subscript -ncmoe 16: failed / invalid vector subscript 

Depth 3 was worse:

depth 3, -ncmoe 20: ~39.8 t/s 

So the MTP sweet spot was:

--spec-draft-n-max 2 

Conclusion

With 12GB VRAM, plain decoding is already very strong:

Plain llama-bench: ~914 t/s pp512, ~46.8 t/s tg128 Best MTP observed: ~47.7 t/s generation 

So MTP only gave about a 2% generation speedup over well-tuned plain decoding. For coding, I would personally use plain decoding with 32k context:

-c 32768 -ncmoe 20 -ctk q8_0 -ctv q8_0 -t 9 

The big lesson: for this MoE model, 12GB VRAM is a very practical sweet spot. It keeps enough experts on GPU that plain decoding becomes fast, q8 KV is usable, and 32k context is realistic.

submitted by /u/jwestra
[link] [comments]