Small Gemma 4, Qwen 3.6 and Qwen 3 Coder Next comparison for a debugging use-case

Reddit r/LocalLLaMA / 4/19/2026

💬 OpinionTools & Practical UsageModels & Research

Key Points

  • The post compares three LLMs (Gemma 4, Qwen 3.6, and Qwen 3 Coder Next) using the same multi-turn debugging task: updating a legacy Flash-based website to work in modern browsers.
  • Gemma 4 and Qwen 3.6 both solved the initial problem in functionally similar ways and then provided useful follow-up feedback, while Qwen 3 Coder Next produced more convoluted suggestions.
  • All three models missed a remaining breaking issue after their proposed fixes, but Gemma 4 then delivered a simple, correct final fix whereas Qwen 3.6 offered a more convoluted approach and still felt less clean.
  • The author notes that the models were prompted via the completions API without an agentic harness or explicit basic CoT prompting, which was called out as a limitation for Qwen 3 Coder Next (non-reasoning behavior).

Nothing extensive to see here, just a quick qualitative and performance comparison for a single programming use-case: Making an ancient website that uses Flash for everything work with modern browsers. I let all 3 models tackle exactly the same issue and provided exactly the same multi-turn feedback.

  • Gemma 4 and Qwen 3.6 both nailed the first issue in a functionally equivalent way and provided useful additional feedback.
  • Q3CN went for a more convoluted fix.
  • All three missed a remaining breaking issue after the proposed fix.
  • Gemma 4 then made a simple, spot-on fix.
  • Qwen 3.6 solved it in a rather convoluted way that felt like it understood the issue less than Gemma 4, despite also pointing it out - yet less cleanly.
  • Q3CN proposed a very convoluted fix that missed the actual issue.

Note that all models were prompted directly via completions API, outside of an agentic harness. Thus Q3CN had the drawback of being a non-reasoning model and not being prompted for basic CoT.

gemma-4-31B-it-UD-Q4_K_XL (18.8 GB) Qwen3.6-35B-A3B-UD-Q5_K_XL (26.6 GB) Qwen3-Coder-Next-UD-Q4_K_XL (49.6 GB)
Initial prompt tokens 60178 53063 50288
Prompt speed (tps) 642 2130 801
Total prompt time (s) 93 25 64
Generated tokens 1938 5437 1076
Response speed (tps) 13 66 40
Total response time (s) 151 82 27
Next turn - - -
Generated tokens 4854 12027 1195
Response speed (tps) 12 59 34
Total response time (s) 396 204 35

Some observations:

  • Qwen 3.6 is the most verbose, also in reasoning, but it's still faster than Gemma 4 due to way higher TPS.
  • Qwen 3.6 clearly wins the prompt processing category.
  • Q3CN is faster despite way larger size due to way less verbosity - no reasoning, reduces capability.
  • In an agentic setting outside that test I found that Gemma 4 deals noticeably better with complex and conflicting information in coding and debugging scenarios. That might be due to dense vs. MoE.

All tests were with the latest llama.cpp, 24 GB VRAM with partial offload due to automated fitting and these options: -fa on --temp 0 -np 1 -c 80000 -ctv q8_0 -ctk q8_0 -b 2048 -ub 2048

(Yes, I'm aware that temp 0 isn't recommended, yet it currently works nicely for me)

submitted by /u/Chromix_
[link] [comments]