Gemma 4 E4B vs Qwen3.5-4B on document tasks: Qwen wins the benchmarks, but the sub-scores tell a different story

Reddit r/LocalLLaMA / 4/8/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

Qwen3.5-4B outperforms Gemma 4 E4B on the IDP Leaderboard’s headline document benchmarks (OlmOCR, OmniDoc, and IDP Core).
Sub-scores reveal a different tradeoff: Gemma leads on raw OCR/text recognition, while Qwen strongly wins on structured extraction tasks like KIE (11.1 vs 86.0 in IDP Core).
For document understanding beyond pixel reading, Gemma appears weaker on compositional spatial reasoning (e.g., larger gaps on Multi-Col subsets in OlmOCR).
The comparison within Gemma shows that scaling from E2B to E4B yields sizable gains across benchmarks, suggesting configuration/size choice can materially change extraction performance.
Practical guidance from the report is that Qwen3.5-4B is preferable for end-to-end extraction pipelines at this model size, whereas Gemma may be better for OCR-heavy preprocessing before handing off to other models.

Gemma 4 E4B vs Qwen3.5-4B on document tasks: Qwen wins the benchmarks, but the sub-scores tell a different story

Results live here: https://www.idp-leaderboard.org/

Ran both through the IDP Leaderboard (OlmOCR Bench, OmniDocBench, IDP Core) and the headline numbers aren't the interesting part.

Top-line scores:

Benchmark	Gemma 4 E4B	Qwen3.5-4B
OlmOCR	47.0	75.4
OmniDoc	59.7	67.6
IDP Core	55.0	74.5

Qwen wins all three. On OlmOCR the gap is 28 points. Open and shut, right?

Not quite. Drill into IDP Core:

Sub-task	Gemma 4 E4B	Qwen3.5-4B
OCR (raw text recognition)	74.0	64.7
KIE (structured extraction)	11.1	86.0
Table	55.0	76.7
VQA	65.3	72.4

Gemma reads text from documents better than Qwen. It just can't do anything structured with what it reads. The KIE collapse (11.1 vs 86.0) isn't a vision failure, it's an instruction-following failure on schema-defined outputs (atleast thats what I'm guessing)

Same pattern in OlmOCR: Gemma scores 48.4 on H&F (handwriting/figures) vs Qwen's 47.2 essentially tied on the hardest visual subset. But Multi-Col is 37.1 vs 79.2. Multi-column layout needs compositional spatial reasoning, not just pixel-level reading.

Within the Gemma family, the E2B (2.3B effective) to E4B (4.5B effective) gap is steep: OlmOCR goes 38.2 → 47.0, OmniDoc 43.3 → 59.7. Worth knowing if you're considering the smaller variant.

Practical takeaways:

If you're running end-to-end extraction pipelines, Qwen3.5-4B is still the better pick at this size. But if you're preprocessing documents before passing to another model and you care about raw text fidelity over structured output, Gemma's perception quality is underrated.

Gemma might be actually better in handwriting recognition as thats what the OCR tasks resemble (Check this for example is one of the benchmark's OCR task: https://www.idp-leaderboard.org/explore/?model=Nanonets+OCR2%2B&benchmark=idp&task=OCR&sample=ocr\_handwriting\_3)

And lastly I felt Gemma is a reasoning powerhouse matching Qwen on VQA benchmark.

The other Gemma angle: E2B and E4B have native audio input baked into the model weights. No separate pipeline. For anyone building voice + document workflows at the edge, nothing else at this size does that.

One genuine problem right now: the 26B MoE variant is running ~11 tok/s vs Qwen 35B-A3B at 60+ tok/s on a 5060 Ti 16GB. Same hardware. The routing overhead is real. Dense 31B is more predictable (~18–25 tok/s on dual consumer GPUs), but the MoE speed gap is hard to ignore.

Anyone running these on real document workloads? Curious whether the KIE gap closes with structured prompting or if it's more fundamental.

submitted by /u/shhdwi
[link] [comments]