Qwen3.6-35B-A3B - even in VRAM limited scenarios it can be better to use bigger quants than you'd expect!

Reddit r/LocalLLaMA / 4/25/2026

💬 OpinionTools & Practical UsageModels & Research

Key Points

  • The author runs Qwen3.6-35B-A3B on an 8GB VRAM RTX 3070 system by using a very small Q4 quant (about 18GB) with 32k context, achieving roughly 25–30 tokens per second.
  • They encounter looping issues during “thinking,” so they test a larger Q4 quant variant (about 23GB) and find it runs substantially faster despite the increased memory use, reaching about 32 tokens per second at 128k context.
  • They ultimately use a Q5_K_S quant as the best quality/speed trade-off, sustaining around 30 tokens per second with a 128k context window.
  • Performance decreases with longer contexts, but the system still stays above 25 tokens per second even at 50k context, leading to the practical takeaway to try bigger quants than expected for MoE models.

So maybe this is a no-brainer to many experienced local LLM users but it was not obvious for me.

I am running a 3070 8gb + 64gb DDR4. Pretty lightweight setup so I chose the smallest Q4 unsloth model Qwen3.6-35B-A3B-UD-IQ4_XS.gguf - which is ~18gb. It does run ok, and with some optimizations in llama.cpp I got about 25-30 tokens/s with a 32k context window.

I did have some problems with looping during thinking so I tried a bigger Q4 model Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf - ~23gb. To my surprise, this is much faster! With a 128k context window, I am seeing 32 tokens/s.

I ended up using Q5_K_S for best quality/speed balance - about 30 tokens/s. Oh, and I'm also using 128k context window. The speed does go down with long context. It's still over 25 at 50k context though! (haven't tested higher yet)

Bottom line - for MoE models like this, experiment with bigger quants than you'd expect to be able to use!

submitted by /u/jeremynsl
[link] [comments]