I have a personal eval harness: A repo with around 30k lines of code that has 37 intentional issues for LLMs to debug and address through an agentic setup (I use OpenCode)
A subset of the harness also has the LLM extract key information from reasonably large PDFs (40-60 pages), summarize and evaluate its findings.
Long story short: The harness tests the following LLM attributes: - Agentic capabilities - Coding - Image-to-text synthesis - Instruction following - Reasoning
Both models at UD-Q4_K_XL for a fair baseline running optimal sampling params. Gemma 4's GGUF after google's latest chat-template fixes and -cram, -ctkcp flags to mitigate DRAM blowups
Here's how it went:
Qwen3.6 Gemma 4 ┌──────────────┐ ┌──────────────┐ Tests Fixed │ 32 / 37 │ │ 28 / 37 │ Regressions │ 0 │ │ 8 │ Net Score │ 32 │ │ 20 │ Post-Run Failures │ 5 │ │ 17 │ Duration │ 49 min │ │ 85 min │ └──────────────┘ └──────────────┘ WINNER ✓
1. Test Results
| Metric | Qwen3.6-35B-A3B | Gemma 4-26B-A4B |
|---|---|---|
| Baseline failures | 37 | 37 |
| Tests fixed | 32 (86.5%) | 28 (75.7%) |
| Regressions | 0 | 8 |
| Net score (fixed − regressed) | 32 | 20 |
| Still failing (of original 37) | 5 | 9 |
| Post-run total failures | 5 | 17 |
| Guardrail violations | 0 | 0 |
Qwen actually identified the 5 leftover failures but decided they were out of scope and intentionally skipped them. Gemma just gave up with multiple retries.
2. Token Usage
| Metric | Qwen3.6 | Gemma 4 | Ratio |
|---|---|---|---|
| Input tokens | 634,965 | 1,005,964 | Gemma 1.6x more |
| Output tokens | 39,476 | 89,750 | Gemma 2.3x more |
| Grand total (I+O) | 674,441 | 1,095,714 | Gemma 1.6x more |
| Cache read tokens | 4,241,502 | 3,530,520 | Qwen 1.2x more |
| Output/Input ratio | 1:16 | 1:11 | Gemma more verbose |
| Tokens per fix | ~21K | ~39K | Gemma 1.9x more expensive |
| Tokens per net score point | ~21K | ~55K | Gemma 2.6x more expensive |
3. Tool Calls
| Tool | Qwen3.6 | Gemma 4 |
|---|---|---|
| read | 46 | 39 |
| bash | 33 | 30 |
| edit | 14 | 13 |
| grep | 16 | 10 |
| todowrite | 4 | 3 |
| glob | 1 | 1 |
| write | 1 | 0 |
| Total | 115 | 96 |
| Successful | 115 (100%) | 96 (100%) |
| Failed | 0 | 0 |
| Derived Metric | Qwen3.6 | Gemma 4 |
|---|---|---|
| Unique files read | 18 | 27 |
| Unique files edited | 7 | 13 |
| Reads per unique file | 2.6 | 1.4 |
| Tool calls per minute | 2.3 | 1.1 |
| Edits per fix | 0.44 | 0.46 |
| Bash (pytest) runs | 33 | 30 |
4. Timing & Efficiency
| Metric | Qwen3.6 | Gemma 4 | Ratio |
|---|---|---|---|
| Wall clock | 2,950s (49m) | 5,129s (85m) | Gemma 1.74x slower |
| Total steps | 120 | 104 | — |
| Avg step duration | 10.0s | 21.7s | Gemma 2.2x slower/step |
Key Observations:
- Both models demonstrate a noticeable leap in agentic capabilities. 95+ tool calls with 0 failures
- Qwen is the better coder (at least in Python which my harness is based on)
- Both models start with identical inference performance but Gemma 4's prefill speeds fluctuate with growing context. Qwen's architecture helps the model maintain similar prefill speeds throughout. Huge for agentic coding!
- A lot of people including myself complain about Qwen being overly verbose with its reasoning wasting an insane number of tokens but to my surprise, it's far more efficient in an agentic environment drastically outperforming Gemma 4 in this regard. It fixed more issues in a shorter span of time consuming fewer tokens
- Image-to-Text synthesis is a different story: Qwen produces 8x more tokens (and time) than Gemma but returns results with greater accuracy. Gemma misinterpreted a few details like numerical extractions which Qwen did not but did reasonably well overall. Quality vs Efficiency. Pick your poison.
- For summarizing and evaluating long PDFs based on instructions, both models are good enough. Comes down to preference. Gemma gets it done quick here again. Qwen thinks a lot more and does slightly better with final evaluation.
Qwen 3.6 35B A3B dominates Gemma 4 26B for my use case and has become my new daily driver striking the best balance of speed and performance.
On the flipside, here are a few pointers in Gemma's favour: - The Qwen 3.5/3.6 series of models have been incredibly resilient to quantization but I'm not sure if Gemma is. A full-weight comparison could be drastically different - Gemma's support is way less mature compared to Qwen's - Single-run variance could have impacted Gemma negatively. However, I believe the evaluation criteria across diverse categories of my harness does a decent job mitigating it. At the end of the day, this is just my personal test verdict.
[link] [comments]



