Gemma 4 31B vs Gemma 4 26B-A4B vs Qwen 3.5 27B — 30-question blind eval with Claude Opus 4.6 as judge

Reddit r/LocalLLaMA / 4/5/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • A 30-question blind head-to-head compared Gemma 4 31B, Gemma 4 26B-A4B, and Qwen 3.5 27B, with Claude Opus 4.6 scoring each response independently on a 0–10 rubric.
  • Qwen 3.5 27B won the most items (14/30, 46.7%) but also recorded three 0.0 scores attributed to format failures or refusals, lowering its overall average.
  • Gemma 4 31B and Gemma 4 26B-A4B tied on average scores (8.82), while Qwen’s average was lower (8.17) due to those extreme score-outs.
  • When the three 0.0 cases are excluded, the author reports Qwen’s average rises to ~9.08, suggesting it may be best when it “doesn’t choke,” but less reliable under the evaluated conditions.
  • Category results show Qwen leading in reasoning and analysis, Gemma 4 31B leading in communication, and meta-alignment splitting evenly across models.

Just finished a 3-way head-to-head. Sharing the raw results because this sub has been good about poking holes in methodology, and I'd rather get that feedback than pretend my setup is perfect.

Setup

  • 30 questions, 6 per category (code, reasoning, analysis, communication, meta-alignment)
  • All three models answer the same question blind — no system prompt differences, same temperature
  • Claude Opus 4.6 judges each response independently on a 0-10 scale with a structured rubric (not "which is better," but absolute scoring per response)
  • Single judge, no swap-and-average this run — I know that introduces positional bias risk, but Opus 4.6 had a 99.9% parse rate in prior batches so I prioritized consistency over multi-judge noise
  • Total cost: $4.50

Win counts (highest score on each question)

Model Wins Win %
Qwen 3.5 27B 14 46.7%
Gemma 4 31B 12 40.0%
Gemma 4 26B-A4B 4 13.3%

Average scores

Model Avg Score Evals
Gemma 4 31B 8.82 30
Gemma 4 26B-A4B 8.82 28
Qwen 3.5 27B 8.17 30

Before you ask — yes, Qwen wins more matchups but has a lower average. That's because it got three 0.0 scores (CODE-001, REASON-004, ANALYSIS-017). Those look like format failures or refusals, not genuinely terrible answers. Strip those out and Qwen's average jumps to ~9.08, highest of the three. So the real story might be: Qwen 3.5 27B is the best model here when it doesn't choke, but it chokes 10% of the time.

Category breakdown

Category Leader
Code Tied — Gemma 4 31B and Qwen (3 each)
Reasoning Qwen dominates (5 of 6)
Analysis Qwen dominates (4 of 6)
Communication Gemma 4 31B dominates (5 of 6)
Meta-alignment Three-way split (2-2-2)

Other things I noticed

  • Gemma 4 26B-A4B (the MoE variant) errored out on 2 questions entirely. When it worked, its scores matched the dense 31B almost exactly — same 8.82 average. Interesting efficiency story if Google cleans up the reliability.
  • Gemma 4 31B had some absurdly long response times — multiple 5-minute generations. Looks like heavy internal chain-of-thought. Didn't correlate with better scores.
  • Qwen 3.5 27B generates 3-5x more tokens per response on average. Verbosity tax is real but the judge didn't seem to penalize or reward it consistently.

Methodology caveats (since this sub rightfully cares)

  • 30 questions is a small sample. I'm not claiming statistical significance, just sharing signal.
  • Single judge (Opus 4.6) means any systematic bias it has will show up in every score. I've validated it against multi-judge panels before and it tracked well, but it's still one model's opinion.
  • LLM-as-judge has known issues: verbosity bias, self-preference bias, positional bias. I use absolute scoring (not pairwise comparison) to reduce some of this, but it's not eliminated.
  • Questions are my own, not pulled from a standard benchmark. That means they're not contaminated, but they also reflect my biases about what matters.

Happy to share the raw per-question scores if anyone wants to dig in. What's your experience been running Gemma 4 locally? Curious if the latency spikes I saw are consistent across different quant levels.

submitted by /u/Silver_Raspberry_811
[link] [comments]