Gemma 4 vs Qwen 3.5 Benchmark Comparison

Reddit r/LocalLLaMA / 4/4/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The article presents a side-by-side comparison of official benchmark results for Qwen 3.5 models and Gemma 4 models across multiple evaluation suites.
Across tests like MMLU-Pro and GPQA Diamond, larger Qwen and Gemma variants generally show higher performance, with Gemma leading in several high-level reasoning and academic benchmarks.
In coding-focused evaluations such as LiveCodeBench v6 and Codeforces ELO, the results vary by model size and variant, indicating no single winner across all coding metrics.
The comparison also includes efficiency-related and tool-use settings (e.g., HLE-n vs HLE-t), where performance differences become especially pronounced under “with tools” configurations.
Overall, the compiled table is positioned as a “neck-and-neck” view of relative strengths, helping readers judge tradeoffs between different model families and scales using consistent benchmarks.

I took the official benchmarks for Qwen 3.5 and Gemma 4 and compiled them into a neck-and-neck comparison here.

Benchmark	Qwen 2B	Gemma E2B	Qwen 4B	Gemma E4B	Qwen 27B	Gemma 31B	Qwen 35B (MoE)	Gemma 26B (MoE)
MMLU-Pro	66.5%	60.0%	79.1%	69.4%	86.1%	85.2%	85.3%	82.6%
GPQA Diamond	51.6%	43.4%	76.2%	58.6%	85.5%	84.3%	84.2%	82.3%
LiveCodeBench v6	69.4%	44.0%	55.8%	52.0%	80.7%	80.0%	74.6%	77.1%
Codeforces ELO	N/A	633	24.1	940	1899	2150	2028	1718
TAU2-Bench	48.8%	24.5%	79.9%	42.2%	79.0%	76.9%	81.2%	68.2%
MMMLU (Multilingual)	63.1%	60.0%	76.1%	69.4%	85.9%	85.2%	85.2%	86.3%
HLE-n (No tools)	N/A	N/A	N/A	N/A	24.3%	19.5%	22.4%	8.7%
HLE-t (With tools)	N/A	N/A	N/A	N/A	48.5%	26.5%	47.4%	17.2%
AIME 2026	N/A	N/A	N/A	42.5%	N/A	89.2%	N/A	88.3%
MMMU Pro (Vision)	N/A	N/A	N/A	N/A	75.0%	76.9%	75.1%	73.8%
MATH-Vision	N/A	N/A	N/A	N/A	86.0%	85.6%	83.9%	82.4%

(Note: Blank or N/A means the official test data wasn't provided for that specific size).

Taken from the model cards of both providers.

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

Dev.to

Dev.to

Dev.to

Dev.to

Dev.to