Gemma 4 31B vs Gemma 4 26B-A4B vs Qwen 3.5 27B — 30-question blind eval with Claude Opus 4.6 as judge

Reddit r/LocalLLaMA / 4/5/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

A 30-question blind head-to-head compared Gemma 4 31B, Gemma 4 26B-A4B, and Qwen 3.5 27B, with Claude Opus 4.6 scoring each response independently on a 0–10 rubric.
Qwen 3.5 27B won the most items (14/30, 46.7%) but also recorded three 0.0 scores attributed to format failures or refusals, lowering its overall average.
Gemma 4 31B and Gemma 4 26B-A4B tied on average scores (8.82), while Qwen’s average was lower (8.17) due to those extreme score-outs.
When the three 0.0 cases are excluded, the author reports Qwen’s average rises to ~9.08, suggesting it may be best when it “doesn’t choke,” but less reliable under the evaluated conditions.
Category results show Qwen leading in reasoning and analysis, Gemma 4 31B leading in communication, and meta-alignment splitting evenly across models.

Just finished a 3-way head-to-head. Sharing the raw results because this sub has been good about poking holes in methodology, and I'd rather get that feedback than pretend my setup is perfect.

Setup

30 questions, 6 per category (code, reasoning, analysis, communication, meta-alignment)
All three models answer the same question blind — no system prompt differences, same temperature
Claude Opus 4.6 judges each response independently on a 0-10 scale with a structured rubric (not "which is better," but absolute scoring per response)
Single judge, no swap-and-average this run — I know that introduces positional bias risk, but Opus 4.6 had a 99.9% parse rate in prior batches so I prioritized consistency over multi-judge noise
Total cost: $4.50

Win counts (highest score on each question)

Model	Wins	Win %
Qwen 3.5 27B	14	46.7%
Gemma 4 31B	12	40.0%
Gemma 4 26B-A4B	4	13.3%

Average scores

Model	Avg Score	Evals
Gemma 4 31B	8.82	30
Gemma 4 26B-A4B	8.82	28
Qwen 3.5 27B	8.17	30

Before you ask — yes, Qwen wins more matchups but has a lower average. That's because it got three 0.0 scores (CODE-001, REASON-004, ANALYSIS-017). Those look like format failures or refusals, not genuinely terrible answers. Strip those out and Qwen's average jumps to ~9.08, highest of the three. So the real story might be: Qwen 3.5 27B is the best model here when it doesn't choke, but it chokes 10% of the time.

Category breakdown

Category	Leader
Code	Tied — Gemma 4 31B and Qwen (3 each)
Reasoning	Qwen dominates (5 of 6)
Analysis	Qwen dominates (4 of 6)
Communication	Gemma 4 31B dominates (5 of 6)
Meta-alignment	Three-way split (2-2-2)

Other things I noticed

Gemma 4 26B-A4B (the MoE variant) errored out on 2 questions entirely. When it worked, its scores matched the dense 31B almost exactly — same 8.82 average. Interesting efficiency story if Google cleans up the reliability.
Gemma 4 31B had some absurdly long response times — multiple 5-minute generations. Looks like heavy internal chain-of-thought. Didn't correlate with better scores.
Qwen 3.5 27B generates 3-5x more tokens per response on average. Verbosity tax is real but the judge didn't seem to penalize or reward it consistently.

Methodology caveats (since this sub rightfully cares)

30 questions is a small sample. I'm not claiming statistical significance, just sharing signal.
Single judge (Opus 4.6) means any systematic bias it has will show up in every score. I've validated it against multi-judge panels before and it tracked well, but it's still one model's opinion.
LLM-as-judge has known issues: verbosity bias, self-preference bias, positional bias. I use absolute scoring (not pairwise comparison) to reduce some of this, but it's not eliminated.
Questions are my own, not pulled from a standard benchmark. That means they're not contaminated, but they also reflect my biases about what matters.

Happy to share the raw per-question scores if anyone wants to dig in. What's your experience been running Gemma 4 locally? Curious if the latency spikes I saw are consistent across different quant levels.

submitted by /u/Silver_Raspberry_811
[link] [comments]

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/5DailyView insight →

Black Hat Asia

AI Business

Who is Xu Rui, the ex-ByteDance executive tapped by Meta to lead AI hardware?

SCMP Tech

I Built a Voice AI with Sub-500ms Latency. Here's the Echo Cancellation Problem Nobody Talks About

Dev.to

LLM Semantic Caching: The 95% Hit Rate Myth (and What Production Data Actually Shows)

Dev.to

Inside the Creative Artificial Intelligence (AI) Stack: Where Human Vision and Artificial Intelligence Meet to Design Future Fashion

MarkTechPost

Gemma 4 31B vs Gemma 4 26B-A4B vs Qwen 3.5 27B — 30-question blind eval with Claude Opus 4.6 as judge

Key Points

💡 Insights using this article

Related Articles

Black Hat Asia

Who is Xu Rui, the ex-ByteDance executive tapped by Meta to lead AI hardware?

I Built a Voice AI with Sub-500ms Latency. Here's the Echo Cancellation Problem Nobody Talks About

LLM Semantic Caching: The 95% Hit Rate Myth (and What Production Data Actually Shows)

Inside the Creative Artificial Intelligence (AI) Stack: Where Human Vision and Artificial Intelligence Meet to Design Future Fashion

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer