Tested Gemma 4 E2B across 10 enterprise task suites against Gemma 2 2B, Gemma 3 4B, Gemma 4 E4B, and Gemma 3 12B. Run locally on Apple Silicon.
Overall ranking (9 evaluable suites):
- Gemma 4 E4B — 83.6%
- Gemma 3 12B — 82.3%
- Gemma 3 4B — 80.8%
- Gemma 4 E2B — 80.4% ← new entry
- Gemma 2 2B — 77.6%
Key E2B results:
- Multi-turn: 70% (highest in family — beats every larger sibling)
- Classification: 92.9% (tied with 4B and 12B)
- Info Extraction F1: 80.2% (matches 12B)
- Multilingual: 83.3%
- Safety: 93.3% (100% prompt injection resistance)
Same parameter count, generational improvement (Gemma 2 2B → Gemma 4 E2B):
- Multi-turn: 40% → 70% (+30)
- RAG grounding: 33.3% → 50% (+17)
- Function calling: 70% → 80% (+10)
7 of 8 suites improved at the same parameter count.
Function calling initially crashed our evaluator with TypeError: unhashable type: 'dict' — the model returned nested dicts where strings were expected. Third small-model evaluator bug I've found this year.
[link] [comments]

