Gemma-4 E4B model's vision seems to be surprisingly poor

Reddit r/LocalLLaMA / 4/7/2026

💬 OpinionSignals & Early TrendsModels & Research

Key Points

  • A user reports that the Gemma-4 E4B vision model performs surprisingly poorly on a ~100-task suite of single-turn image understanding problems without tools.
  • In the reporter’s tests, Gemma-4-E4b scores 0.27 versus Qwen3.5-4b’s 0.5 on the same calibrated evaluation, indicating a substantial gap in vision capability.
  • The user tested quantized models (Q8) using both llama.cpp (with image-min-tokens set per Gemma-4 docs) and the Hugging Face transformers library, with the poor performance persisting.
  • They cite an example where the model fails to produce an answer in the expected structured format, while Qwen3.5-4b succeeds.
  • The post requests confirmation from others, implying uncertainty about whether the results reflect model limitations or evaluation/configuration issues.

The E4B model is performing very poorly in my tests and since no one seems to be talking about it that I had to unlurk myself and post this. Its performing badly even compared to qwen3.5-4b. Can someone confirm or dis...uh...firm (?)

My test suite has roughly 100 vision related tasks: single-turn with no tools, only an input image and prompt, but with definitive answers (not all of them are VQA though). Most of these tasks are upstream from any kind of agentic use case.

To give a sense: there are tests where the inputs are screenshots from which certain text information has to be extracted, others are images on which the model has to perform some inference (for example: geoguessing on travel images, calculating total cost of a grocery list given an image of the relevant supermarket display shelf with clearly visible price tags etc).

The first round was conducted on unsloth and bartowski's Q8 quants using llama cpp (b8680 with image-min-tokens set at 1120 as per the gemma-4 docs) and they performed so badly that I shifted to using the transformers library.

The outcome of the tests are:

Qwen3.5-4b: 0.5 (the tests are calibrated such that 4b model scores a 0.5) Gemma-4-E4b: 0.27

Note: The test evaluation are designed to give partial credit so for example for this image from the HF gemma 4 official blogpost: seagull, the acceptable answer is a 2-tuple: (venice, italy). E4B Q8 doesn't answer at all, if I use transformers lib I get (rome, italy). Qwen3.5-4b gets this right (so does 9b models such as qwen3.5-9b, Glm 4.6v flash)

submitted by /u/specji
[link] [comments]