Qwen 3.6 35B crushes Gemma 4 26B on my tests

Reddit r/LocalLLaMA / 4/18/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • A personal evaluation harness with 37 intentional LLM issues (plus an agentic setup via OpenCode) found Qwen 3.6 35B fixed 32 of the 37 issues versus Gemma 4 26B fixing 28, with no regressions for Qwen.
  • Qwen 3.6 achieved a higher net score (32 vs. 20) and made fewer post-run failures (5 vs. 17), indicating stronger overall problem-solving and instruction-following in this test.
  • The author notes Qwen identified the remaining 5 failures but intentionally skipped them as out-of-scope, while Gemma repeatedly retried and still ended with more unresolved failures.
  • Token-efficiency favored Qwen: Gemma consumed about 1.6× more total tokens (and produced 2.3× more output tokens), making Gemma roughly 2.6× more expensive per net score point in this setup.
  • The test also assessed agentic capabilities, coding, instruction following, reasoning, and image-to-text/PDF information extraction, suggesting Qwen’s lead spans multiple capability dimensions rather than a single metric.

I have a personal eval harness: A repo with around 30k lines of code that has 37 intentional issues for LLMs to debug and address through an agentic setup (I use OpenCode)

A subset of the harness also has the LLM extract key information from reasonably large PDFs (40-60 pages), summarize and evaluate its findings.

Long story short: The harness tests the following LLM attributes: - Agentic capabilities - Coding - Image-to-text synthesis - Instruction following - Reasoning

Both models at UD-Q4_K_XL for a fair baseline running optimal sampling params. Gemma 4's GGUF after google's latest chat-template fixes and -cram, -ctkcp flags to mitigate DRAM blowups

Here's how it went:

Qwen3.6 Gemma 4 ┌──────────────┐ ┌──────────────┐ Tests Fixed │ 32 / 37 │ │ 28 / 37 │ Regressions │ 0 │ │ 8 │ Net Score │ 32 │ │ 20 │ Post-Run Failures │ 5 │ │ 17 │ Duration │ 49 min │ │ 85 min │ └──────────────┘ └──────────────┘ WINNER ✓


1. Test Results

Metric Qwen3.6-35B-A3B Gemma 4-26B-A4B
Baseline failures 37 37
Tests fixed 32 (86.5%) 28 (75.7%)
Regressions 0 8
Net score (fixed − regressed) 32 20
Still failing (of original 37) 5 9
Post-run total failures 5 17
Guardrail violations 0 0

Qwen actually identified the 5 leftover failures but decided they were out of scope and intentionally skipped them. Gemma just gave up with multiple retries.


2. Token Usage

Metric Qwen3.6 Gemma 4 Ratio
Input tokens 634,965 1,005,964 Gemma 1.6x more
Output tokens 39,476 89,750 Gemma 2.3x more
Grand total (I+O) 674,441 1,095,714 Gemma 1.6x more
Cache read tokens 4,241,502 3,530,520 Qwen 1.2x more
Output/Input ratio 1:16 1:11 Gemma more verbose
Tokens per fix ~21K ~39K Gemma 1.9x more expensive
Tokens per net score point ~21K ~55K Gemma 2.6x more expensive

3. Tool Calls

Tool Qwen3.6 Gemma 4
read 46 39
bash 33 30
edit 14 13
grep 16 10
todowrite 4 3
glob 1 1
write 1 0
Total 115 96
Successful 115 (100%) 96 (100%)
Failed 0 0
Derived Metric Qwen3.6 Gemma 4
Unique files read 18 27
Unique files edited 7 13
Reads per unique file 2.6 1.4
Tool calls per minute 2.3 1.1
Edits per fix 0.44 0.46
Bash (pytest) runs 33 30

4. Timing & Efficiency

Metric Qwen3.6 Gemma 4 Ratio
Wall clock 2,950s (49m) 5,129s (85m) Gemma 1.74x slower
Total steps 120 104
Avg step duration 10.0s 21.7s Gemma 2.2x slower/step

Key Observations:

  • Both models demonstrate a noticeable leap in agentic capabilities. 95+ tool calls with 0 failures
  • Qwen is the better coder (at least in Python which my harness is based on)
  • Both models start with identical inference performance but Gemma 4's prefill speeds fluctuate with growing context. Qwen's architecture helps the model maintain similar prefill speeds throughout. Huge for agentic coding!
  • A lot of people including myself complain about Qwen being overly verbose with its reasoning wasting an insane number of tokens but to my surprise, it's far more efficient in an agentic environment drastically outperforming Gemma 4 in this regard. It fixed more issues in a shorter span of time consuming fewer tokens
  • Image-to-Text synthesis is a different story: Qwen produces 8x more tokens (and time) than Gemma but returns results with greater accuracy. Gemma misinterpreted a few details like numerical extractions which Qwen did not but did reasonably well overall. Quality vs Efficiency. Pick your poison.
  • For summarizing and evaluating long PDFs based on instructions, both models are good enough. Comes down to preference. Gemma gets it done quick here again. Qwen thinks a lot more and does slightly better with final evaluation.

Qwen 3.6 35B A3B dominates Gemma 4 26B for my use case and has become my new daily driver striking the best balance of speed and performance.

On the flipside, here are a few pointers in Gemma's favour: - The Qwen 3.5/3.6 series of models have been incredibly resilient to quantization but I'm not sure if Gemma is. A full-weight comparison could be drastically different - Gemma's support is way less mature compared to Qwen's - Single-run variance could have impacted Gemma negatively. However, I believe the evaluation criteria across diverse categories of my harness does a decent job mitigating it. At the end of the day, this is just my personal test verdict.

submitted by /u/Lowkey_LokiSN
[link] [comments]