So maybe this is a no-brainer to many experienced local LLM users but it was not obvious for me.
I am running a 3070 8gb + 64gb DDR4. Pretty lightweight setup so I chose the smallest Q4 unsloth model Qwen3.6-35B-A3B-UD-IQ4_XS.gguf - which is ~18gb. It does run ok, and with some optimizations in llama.cpp I got about 25-30 tokens/s with a 32k context window.
I did have some problems with looping during thinking so I tried a bigger Q4 model Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf - ~23gb. To my surprise, this is much faster! With a 128k context window, I am seeing 32 tokens/s.
I ended up using Q5_K_S for best quality/speed balance - about 30 tokens/s. Oh, and I'm also using 128k context window. The speed does go down with long context. It's still over 25 at 50k context though! (haven't tested higher yet)
Bottom line - for MoE models like this, experiment with bigger quants than you'd expect to be able to use!
[link] [comments]



