Gemma 4 vs Qwen 3.5 Benchmark Comparison

Reddit r/LocalLLaMA / 4/4/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The article presents a side-by-side comparison of official benchmark results for Qwen 3.5 models and Gemma 4 models across multiple evaluation suites.
  • Across tests like MMLU-Pro and GPQA Diamond, larger Qwen and Gemma variants generally show higher performance, with Gemma leading in several high-level reasoning and academic benchmarks.
  • In coding-focused evaluations such as LiveCodeBench v6 and Codeforces ELO, the results vary by model size and variant, indicating no single winner across all coding metrics.
  • The comparison also includes efficiency-related and tool-use settings (e.g., HLE-n vs HLE-t), where performance differences become especially pronounced under “with tools” configurations.
  • Overall, the compiled table is positioned as a “neck-and-neck” view of relative strengths, helping readers judge tradeoffs between different model families and scales using consistent benchmarks.

I took the official benchmarks for Qwen 3.5 and Gemma 4 and compiled them into a neck-and-neck comparison here.

The Benchmark Table

Benchmark Qwen 2B Gemma E2B Qwen 4B Gemma E4B Qwen 27B Gemma 31B Qwen 35B (MoE) Gemma 26B (MoE)
MMLU-Pro 66.5% 60.0% 79.1% 69.4% 86.1% 85.2% 85.3% 82.6%
GPQA Diamond 51.6% 43.4% 76.2% 58.6% 85.5% 84.3% 84.2% 82.3%
LiveCodeBench v6 69.4% 44.0% 55.8% 52.0% 80.7% 80.0% 74.6% 77.1%
Codeforces ELO N/A 633 24.1 940 1899 2150 2028 1718
TAU2-Bench 48.8% 24.5% 79.9% 42.2% 79.0% 76.9% 81.2% 68.2%
MMMLU (Multilingual) 63.1% 60.0% 76.1% 69.4% 85.9% 85.2% 85.2% 86.3%
HLE-n (No tools) N/A N/A N/A N/A 24.3% 19.5% 22.4% 8.7%
HLE-t (With tools) N/A N/A N/A N/A 48.5% 26.5% 47.4% 17.2%
AIME 2026 N/A N/A N/A 42.5% N/A 89.2% N/A 88.3%
MMMU Pro (Vision) N/A N/A N/A N/A 75.0% 76.9% 75.1% 73.8%
MATH-Vision N/A N/A N/A N/A 86.0% 85.6% 83.9% 82.4%

(Note: Blank or N/A means the official test data wasn't provided for that specific size).

Taken from the model cards of both providers.

Sources: https://qwen.ai/blog?id=qwen3.5 https://huggingface.co/Qwen/Qwen3.5-27B https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/

submitted by /u/Fuzzy_Philosophy_606
[link] [comments]