Gemma 4 Vision

Reddit r/LocalLLaMA / 4/22/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageIndustry & Market MovesModels & Research

Key Points

  • The post argues that many users do not configure Gemma 4 Vision’s “vision budget,” causing the default settings to be too low for OCR of tiny details.
  • Gemma 4 uses Variable Image Resolution with a default max vision budget of 280 (about 645K pixels), which the author says makes the model effectively “blind” for fine OCR tasks.
  • In llama.cpp, users can raise vision budget via --image-min-tokens and --image-max-tokens; the author reports better results with 560 and 2240 than with lower default-style values.
  • The author notes that increasing --image-max-tokens requires also increasing --batch-size and --ubatch-size, which significantly raises VRAM requirements (up to ~77GB in their example for q8_0 at max context).
  • They claim Ollama likely cannot benefit from these settings until an open issue is fixed, while still asserting that properly tuned Gemma 4 can be state-of-the-art for vision and OCR compared with several other models.

A lot of people in the Gemma 4 Model Request Thread were asking for better vision capabilities in the next Gemma Model. This tells me that people are not configuring Gemma 4's vision budget.

Gemma 4 ships with Variable Image Resolution. The default max vision budget is 280 (~645K pixels) which is way too less. In this mode, it fails to OCR tiny details. It's essentially blind in my books.

In llama.cpp, you can configure Gemma 4's vision budget with 2 parameters --image-min-tokens and --image-max-tokens. The engine will try to fit the image within those bounds. I believe the default is 40 and 280 respectively. This is Gemma 4's default from Google's side but it's way too low.

I like to run them at 560 and 2240 respectively and it's able to pick up very minute and hazy details within images.

Why 2240 - isn't that double of the max from Google (1120)? In my testing, 2240 for some reason works better than 1120. I suspect this might be because of llama.cpp's implementation where it tries to fit the image between min and max tokens.

Additionally, you will also have to set --batch-size and--ubatch-size above whatever value you choose for image-max-tokens. I run them at 4096 (for --image-max-tokens 2240). This will consume a lot more VRAM (63 GB (default) to 77 GB (4096 batch) for q8_0 at max context).

If you use Ollama, you are likely SOL until and if they care to fix this.

It's worth it though, with a higher vision budget, Gemma 4 is pretty much SOTA for Vision and pretty much destroys anything else especially for OCR - Qwen 3.5, Qwen 3.6, GLM OCR (or any other random OCR), Kimi K2.5. I haven't tested Kimi K2.6 and I refuse to touch Cloud Models.

submitted by /u/seamonn
[link] [comments]