The E4B model is performing very poorly in my tests and since no one seems to be talking about it that I had to unlurk myself and post this. Its performing badly even compared to qwen3.5-4b. Can someone confirm or dis...uh...firm (?)
My test suite has roughly 100 vision related tasks: single-turn with no tools, only an input image and prompt, but with definitive answers (not all of them are VQA though). Most of these tasks are upstream from any kind of agentic use case.
To give a sense: there are tests where the inputs are screenshots from which certain text information has to be extracted, others are images on which the model has to perform some inference (for example: geoguessing on travel images, calculating total cost of a grocery list given an image of the relevant supermarket display shelf with clearly visible price tags etc).
The first round was conducted on unsloth and bartowski's Q8 quants using llama cpp (b8680 with image-min-tokens set at 1120 as per the gemma-4 docs) and they performed so badly that I shifted to using the transformers library.
The outcome of the tests are:
Qwen3.5-4b: 0.5 (the tests are calibrated such that 4b model scores a 0.5) Gemma-4-E4b: 0.27
Note: The test evaluation are designed to give partial credit so for example for this image from the HF gemma 4 official blogpost: seagull, the acceptable answer is a 2-tuple: (venice, italy). E4B Q8 doesn't answer at all, if I use transformers lib I get (rome, italy). Qwen3.5-4b gets this right (so does 9b models such as qwen3.5-9b, Glm 4.6v flash)
[link] [comments]




