Qwen 3.6 35B crushes Gemma 4 26B on my tests

Reddit r/LocalLLaMA / 4/18/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

A personal evaluation harness with 37 intentional LLM issues (plus an agentic setup via OpenCode) found Qwen 3.6 35B fixed 32 of the 37 issues versus Gemma 4 26B fixing 28, with no regressions for Qwen.
Qwen 3.6 achieved a higher net score (32 vs. 20) and made fewer post-run failures (5 vs. 17), indicating stronger overall problem-solving and instruction-following in this test.
The author notes Qwen identified the remaining 5 failures but intentionally skipped them as out-of-scope, while Gemma repeatedly retried and still ended with more unresolved failures.
Token-efficiency favored Qwen: Gemma consumed about 1.6× more total tokens (and produced 2.3× more output tokens), making Gemma roughly 2.6× more expensive per net score point in this setup.
The test also assessed agentic capabilities, coding, instruction following, reasoning, and image-to-text/PDF information extraction, suggesting Qwen’s lead spans multiple capability dimensions rather than a single metric.

I have a personal eval harness: A repo with around 30k lines of code that has 37 intentional issues for LLMs to debug and address through an agentic setup (I use OpenCode)

A subset of the harness also has the LLM extract key information from reasonably large PDFs (40-60 pages), summarize and evaluate its findings.

Long story short: The harness tests the following LLM attributes: - Agentic capabilities - Coding - Image-to-text synthesis - Instruction following - Reasoning

Both models at UD-Q4_K_XL for a fair baseline running optimal sampling params. Gemma 4's GGUF after google's latest chat-template fixes and -cram, -ctkcp flags to mitigate DRAM blowups

Here's how it went:

Qwen3.6 Gemma 4 ┌──────────────┐ ┌──────────────┐ Tests Fixed │ 32 / 37 │ │ 28 / 37 │ Regressions │ 0 │ │ 8 │ Net Score │ 32 │ │ 20 │ Post-Run Failures │ 5 │ │ 17 │ Duration │ 49 min │ │ 85 min │ └──────────────┘ └──────────────┘ WINNER ✓

1. Test Results

Metric	Qwen3.6-35B-A3B	Gemma 4-26B-A4B
Baseline failures	37	37
Tests fixed	32 (86.5%)	28 (75.7%)
Regressions	0	8
Net score (fixed − regressed)	32	20
Still failing (of original 37)	5	9
Post-run total failures	5	17
Guardrail violations	0	0

Qwen actually identified the 5 leftover failures but decided they were out of scope and intentionally skipped them. Gemma just gave up with multiple retries.

2. Token Usage

Metric	Qwen3.6	Gemma 4	Ratio
Input tokens	634,965	1,005,964	Gemma 1.6x more
Output tokens	39,476	89,750	Gemma 2.3x more
Grand total (I+O)	674,441	1,095,714	Gemma 1.6x more
Cache read tokens	4,241,502	3,530,520	Qwen 1.2x more
Output/Input ratio	1:16	1:11	Gemma more verbose
Tokens per fix	~21K	~39K	Gemma 1.9x more expensive
Tokens per net score point	~21K	~55K	Gemma 2.6x more expensive

3. Tool Calls

Tool	Qwen3.6	Gemma 4
read	46	39
bash	33	30
edit	14	13
grep	16	10
todowrite	4	3
glob	1	1
write	1	0
Total	115	96
Successful	115 (100%)	96 (100%)
Failed	0	0

Derived Metric	Qwen3.6	Gemma 4
Unique files read	18	27
Unique files edited	7	13
Reads per unique file	2.6	1.4
Tool calls per minute	2.3	1.1
Edits per fix	0.44	0.46
Bash (pytest) runs	33	30

4. Timing & Efficiency

Metric	Qwen3.6	Gemma 4	Ratio
Wall clock	2,950s (49m)	5,129s (85m)	Gemma 1.74x slower
Total steps	120	104	—
Avg step duration	10.0s	21.7s	Gemma 2.2x slower/step

Key Observations:

Both models demonstrate a noticeable leap in agentic capabilities. 95+ tool calls with 0 failures
Qwen is the better coder (at least in Python which my harness is based on)
Both models start with identical inference performance but Gemma 4's prefill speeds fluctuate with growing context. Qwen's architecture helps the model maintain similar prefill speeds throughout. Huge for agentic coding!
A lot of people including myself complain about Qwen being overly verbose with its reasoning wasting an insane number of tokens but to my surprise, it's far more efficient in an agentic environment drastically outperforming Gemma 4 in this regard. It fixed more issues in a shorter span of time consuming fewer tokens
Image-to-Text synthesis is a different story: Qwen produces 8x more tokens (and time) than Gemma but returns results with greater accuracy. Gemma misinterpreted a few details like numerical extractions which Qwen did not but did reasonably well overall. Quality vs Efficiency. Pick your poison.
For summarizing and evaluating long PDFs based on instructions, both models are good enough. Comes down to preference. Gemma gets it done quick here again. Qwen thinks a lot more and does slightly better with final evaluation.

Qwen 3.6 35B A3B dominates Gemma 4 26B for my use case and has become my new daily driver striking the best balance of speed and performance.

On the flipside, here are a few pointers in Gemma's favour: - The Qwen 3.5/3.6 series of models have been incredibly resilient to quantization but I'm not sure if Gemma is. A full-weight comparison could be drastically different - Gemma's support is way less mature compared to Qwen's - Single-run variance could have impacted Gemma negatively. However, I believe the evaluation criteria across diverse categories of my harness does a decent job mitigating it. At the end of the day, this is just my personal test verdict.

submitted by /u/Lowkey_LokiSN
[link] [comments]