Just finished a 3-way head-to-head. Sharing the raw results because this sub has been good about poking holes in methodology, and I'd rather get that feedback than pretend my setup is perfect.
Setup
- 30 questions, 6 per category (code, reasoning, analysis, communication, meta-alignment)
- All three models answer the same question blind — no system prompt differences, same temperature
- Claude Opus 4.6 judges each response independently on a 0-10 scale with a structured rubric (not "which is better," but absolute scoring per response)
- Single judge, no swap-and-average this run — I know that introduces positional bias risk, but Opus 4.6 had a 99.9% parse rate in prior batches so I prioritized consistency over multi-judge noise
- Total cost: $4.50
Win counts (highest score on each question)
| Model | Wins | Win % |
|---|---|---|
| Qwen 3.5 27B | 14 | 46.7% |
| Gemma 4 31B | 12 | 40.0% |
| Gemma 4 26B-A4B | 4 | 13.3% |
Average scores
| Model | Avg Score | Evals |
|---|---|---|
| Gemma 4 31B | 8.82 | 30 |
| Gemma 4 26B-A4B | 8.82 | 28 |
| Qwen 3.5 27B | 8.17 | 30 |
Before you ask — yes, Qwen wins more matchups but has a lower average. That's because it got three 0.0 scores (CODE-001, REASON-004, ANALYSIS-017). Those look like format failures or refusals, not genuinely terrible answers. Strip those out and Qwen's average jumps to ~9.08, highest of the three. So the real story might be: Qwen 3.5 27B is the best model here when it doesn't choke, but it chokes 10% of the time.
Category breakdown
| Category | Leader |
|---|---|
| Code | Tied — Gemma 4 31B and Qwen (3 each) |
| Reasoning | Qwen dominates (5 of 6) |
| Analysis | Qwen dominates (4 of 6) |
| Communication | Gemma 4 31B dominates (5 of 6) |
| Meta-alignment | Three-way split (2-2-2) |
Other things I noticed
- Gemma 4 26B-A4B (the MoE variant) errored out on 2 questions entirely. When it worked, its scores matched the dense 31B almost exactly — same 8.82 average. Interesting efficiency story if Google cleans up the reliability.
- Gemma 4 31B had some absurdly long response times — multiple 5-minute generations. Looks like heavy internal chain-of-thought. Didn't correlate with better scores.
- Qwen 3.5 27B generates 3-5x more tokens per response on average. Verbosity tax is real but the judge didn't seem to penalize or reward it consistently.
Methodology caveats (since this sub rightfully cares)
- 30 questions is a small sample. I'm not claiming statistical significance, just sharing signal.
- Single judge (Opus 4.6) means any systematic bias it has will show up in every score. I've validated it against multi-judge panels before and it tracked well, but it's still one model's opinion.
- LLM-as-judge has known issues: verbosity bias, self-preference bias, positional bias. I use absolute scoring (not pairwise comparison) to reduce some of this, but it's not eliminated.
- Questions are my own, not pulled from a standard benchmark. That means they're not contaminated, but they also reflect my biases about what matters.
Happy to share the raw per-question scores if anyone wants to dig in. What's your experience been running Gemma 4 locally? Curious if the latency spikes I saw are consistent across different quant levels.
[link] [comments]



