Benchmarked Gemma 4 E2B: The 2B model beat every larger sibling on multi-turn (70%)

Reddit r/LocalLLaMA / 4/14/2026

💬 OpinionSignals & Early TrendsModels & Research

共有:

Key Points

A new benchmark of the Gemma 4 E2B model evaluated it on 10 enterprise task suites against larger Gemma variants, showing strong overall performance.
In the multi-turn category, Gemma 4 E2B achieved 70%, which the report claims is the highest within the family and beats every larger sibling.
Across other metrics, Gemma 4 E2B scored 92.9% on classification, 80.2% on info extraction F1, 83.3% on multilingual, and 93.3% on safety with 100% prompt-injection resistance.
When compared to the prior 2B generation, Gemma 4 E2B shows clear gains at the same parameter scale, including multi-turn (40%→70%), RAG grounding (33.3%→50%), and function calling (70%→80%).
The evaluator used in the test also encountered a function-calling-related crash due to nested dict outputs, highlighting practical tooling/evaluation issues for small models.

Tested Gemma 4 E2B across 10 enterprise task suites against Gemma 2 2B, Gemma 3 4B, Gemma 4 E4B, and Gemma 3 12B. Run locally on Apple Silicon.

Overall ranking (9 evaluable suites):

Gemma 4 E4B — 83.6%
Gemma 3 12B — 82.3%
Gemma 3 4B — 80.8%
Gemma 4 E2B — 80.4% ← new entry
Gemma 2 2B — 77.6%

Key E2B results:

Multi-turn: 70% (highest in family — beats every larger sibling)
Classification: 92.9% (tied with 4B and 12B)
Info Extraction F1: 80.2% (matches 12B)
Multilingual: 83.3%
Safety: 93.3% (100% prompt injection resistance)

Same parameter count, generational improvement (Gemma 2 2B → Gemma 4 E2B):

Multi-turn: 40% → 70% (+30)
RAG grounding: 33.3% → 50% (+17)
Function calling: 70% → 80% (+10)

7 of 8 suites improved at the same parameter count.

Function calling initially crashed our evaluator with TypeError: unhashable type: 'dict' — the model returned nested dicts where strings were expected. Third small-model evaluator bug I've found this year.

submitted by /u/Zealousideal-Yard328
[link] [comments]

Black Hat Asia

AI Business

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Don't forget, there is more than forgetting: new metrics for Continual Learning

Dev.to

Microsoft MAI-Image-2-Efficient Review 2026: The AI Image Model Built for Production Scale

Dev.to

MCPNest - I built an MCP server marketplace in 7 days.

Dev.to

Benchmarked Gemma 4 E2B: The 2B model beat every larger sibling on multi-turn (70%)

Key Points

Related Articles

Black Hat Asia

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Don't forget, there is more than forgetting: new metrics for Continual Learning

Microsoft MAI-Image-2-Efficient Review 2026: The AI Image Model Built for Production Scale

MCPNest - I built an MCP server marketplace in 7 days.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer