Nothing extensive to see here, just a quick qualitative and performance comparison for a single programming use-case: Making an ancient website that uses Flash for everything work with modern browsers. I let all 3 models tackle exactly the same issue and provided exactly the same multi-turn feedback.
- Gemma 4 and Qwen 3.6 both nailed the first issue in a functionally equivalent way and provided useful additional feedback.
- Q3CN went for a more convoluted fix.
- All three missed a remaining breaking issue after the proposed fix.
- Gemma 4 then made a simple, spot-on fix.
- Qwen 3.6 solved it in a rather convoluted way that felt like it understood the issue less than Gemma 4, despite also pointing it out - yet less cleanly.
- Q3CN proposed a very convoluted fix that missed the actual issue.
Note that all models were prompted directly via completions API, outside of an agentic harness. Thus Q3CN had the drawback of being a non-reasoning model and not being prompted for basic CoT.
| gemma-4-31B-it-UD-Q4_K_XL (18.8 GB) | Qwen3.6-35B-A3B-UD-Q5_K_XL (26.6 GB) | Qwen3-Coder-Next-UD-Q4_K_XL (49.6 GB) | |
|---|---|---|---|
| Initial prompt tokens | 60178 | 53063 | 50288 |
| Prompt speed (tps) | 642 | 2130 | 801 |
| Total prompt time (s) | 93 | 25 | 64 |
| Generated tokens | 1938 | 5437 | 1076 |
| Response speed (tps) | 13 | 66 | 40 |
| Total response time (s) | 151 | 82 | 27 |
| Next turn | - | - | - |
| Generated tokens | 4854 | 12027 | 1195 |
| Response speed (tps) | 12 | 59 | 34 |
| Total response time (s) | 396 | 204 | 35 |
Some observations:
- Qwen 3.6 is the most verbose, also in reasoning, but it's still faster than Gemma 4 due to way higher TPS.
- Qwen 3.6 clearly wins the prompt processing category.
- Q3CN is faster despite way larger size due to way less verbosity - no reasoning, reduces capability.
- In an agentic setting outside that test I found that Gemma 4 deals noticeably better with complex and conflicting information in coding and debugging scenarios. That might be due to dense vs. MoE.
All tests were with the latest llama.cpp, 24 GB VRAM with partial offload due to automated fitting and these options: -fa on --temp 0 -np 1 -c 80000 -ctv q8_0 -ctk q8_0 -b 2048 -ub 2048
(Yes, I'm aware that temp 0 isn't recommended, yet it currently works nicely for me)
[link] [comments]

