Hi all,
have been reading here for over two years and finally have a question I can't find an answer to.
Qwen 3.5 27B and Gemma 4 31B have been the latest examples of dense models performing much more accurately and in general tasks requiring higher precision, where vast knowledge isn't of highest priority. Hence, I wonder what specifically made Qwen (as the only known developer of coding-specific models) choose their 30B MoE, and the subsequent 80B A3B super-sparse MoE, as the suitable architecture to fine-tune into a coding model? What are these models using the experts for, I certainly don't think each expert is their own language/syntax...
Why did they not proceed on the 27B for example? Or even the 9B dense?
I can only assume it has to do with inference speed, both PP and TG is certainly much slower on the dense models. I am hence even more sad that they didn't release a 14B successor, something that could run on 16GB VRAM quantised with ample room for context.
Any insight would be highly appreciated.
[link] [comments]




