Benchmarked Gemma 4 E2B: The 2B model beat every larger sibling on multi-turn (70%)

Reddit r/LocalLLaMA / 4/14/2026

💬 OpinionSignals & Early TrendsModels & Research

Key Points

  • A new benchmark of the Gemma 4 E2B model evaluated it on 10 enterprise task suites against larger Gemma variants, showing strong overall performance.
  • In the multi-turn category, Gemma 4 E2B achieved 70%, which the report claims is the highest within the family and beats every larger sibling.
  • Across other metrics, Gemma 4 E2B scored 92.9% on classification, 80.2% on info extraction F1, 83.3% on multilingual, and 93.3% on safety with 100% prompt-injection resistance.
  • When compared to the prior 2B generation, Gemma 4 E2B shows clear gains at the same parameter scale, including multi-turn (40%→70%), RAG grounding (33.3%→50%), and function calling (70%→80%).
  • The evaluator used in the test also encountered a function-calling-related crash due to nested dict outputs, highlighting practical tooling/evaluation issues for small models.

Tested Gemma 4 E2B across 10 enterprise task suites against Gemma 2 2B, Gemma 3 4B, Gemma 4 E4B, and Gemma 3 12B. Run locally on Apple Silicon.

Overall ranking (9 evaluable suites):

  • Gemma 4 E4B — 83.6%
  • Gemma 3 12B — 82.3%
  • Gemma 3 4B — 80.8%
  • Gemma 4 E2B — 80.4% ← new entry
  • Gemma 2 2B — 77.6%

Key E2B results:

  • Multi-turn: 70% (highest in family — beats every larger sibling)
  • Classification: 92.9% (tied with 4B and 12B)
  • Info Extraction F1: 80.2% (matches 12B)
  • Multilingual: 83.3%
  • Safety: 93.3% (100% prompt injection resistance)

Same parameter count, generational improvement (Gemma 2 2B → Gemma 4 E2B):

  • Multi-turn: 40% → 70% (+30)
  • RAG grounding: 33.3% → 50% (+17)
  • Function calling: 70% → 80% (+10)

7 of 8 suites improved at the same parameter count.

Function calling initially crashed our evaluator with TypeError: unhashable type: 'dict' — the model returned nested dicts where strings were expected. Third small-model evaluator bug I've found this year.

submitted by /u/Zealousideal-Yard328
[link] [comments]