Gemma 4 Vision

Reddit r/LocalLLaMA / 2026/4/22

💬 オピニオンDeveloper Stack & InfrastructureTools & Practical UsageIndustry & Market MovesModels & Research

要点

  • 投稿者は、Gemma 4 Visionの「ビジョン予算」が適切に設定されていないため、デフォルト値が低く微細なOCRに向かないと主張しています。
  • Gemma 4はVariable Image Resolutionを使い、デフォルトの最大ビジョン予算は280(約645Kピクセル)で、投稿者はこれだと細かなOCRでは実質的に「見えない」と述べています。
  • llama.cppでは--image-min-tokensと--image-max-tokensでビジョン予算を調整でき、投稿者は560と2240の組み合わせがより低い設定よりも微細でかすれたディテールの検出に有効だと報告しています。
  • --image-max-tokensを引き上げるには--batch-sizeと--ubatch-sizeも増やす必要があり、q8_0・最大コンテキストでVRAMが大きく増える(例として最大で約77GB)と注意しています。
  • 投稿者は、設定の反映が難しいためOllamaでは現状(修正されるまで)恩恵を受けにくい可能性がある一方、適切にチューニングすれば視覚・OCRで他モデルに対して非常に優位(SOTA級)になり得ると述べています。

A lot of people in the Gemma 4 Model Request Thread were asking for better vision capabilities in the next Gemma Model. This tells me that people are not configuring Gemma 4's vision budget.

Gemma 4 ships with Variable Image Resolution. The default max vision budget is 280 (~645K pixels) which is way too less. In this mode, it fails to OCR tiny details. It's essentially blind in my books.

In llama.cpp, you can configure Gemma 4's vision budget with 2 parameters --image-min-tokens and --image-max-tokens. The engine will try to fit the image within those bounds. I believe the default is 40 and 280 respectively. This is Gemma 4's default from Google's side but it's way too low.

I like to run them at 560 and 2240 respectively and it's able to pick up very minute and hazy details within images.

Why 2240 - isn't that double of the max from Google (1120)? In my testing, 2240 for some reason works better than 1120. I suspect this might be because of llama.cpp's implementation where it tries to fit the image between min and max tokens.

Additionally, you will also have to set --batch-size and--ubatch-size above whatever value you choose for image-max-tokens. I run them at 4096 (for --image-max-tokens 2240). This will consume a lot more VRAM (63 GB (default) to 77 GB (4096 batch) for q8_0 at max context).

If you use Ollama, you are likely SOL until and if they care to fix this.

It's worth it though, with a higher vision budget, Gemma 4 is pretty much SOTA for Vision and pretty much destroys anything else especially for OCR - Qwen 3.5, Qwen 3.6, GLM OCR (or any other random OCR), Kimi K2.5. I haven't tested Kimi K2.6 and I refuse to touch Cloud Models.

submitted by /u/seamonn
[link] [comments]