Get 30K more context using Q8 mmproj with Gemma 4

Reddit r/LocalLLaMA / 4/6/2026

💬 OpinionSignals & Early TrendsTools & Practical Usage

共有:

Key Points

The post claims that using the Q8_0 mmproj for vision with Gemma 4 instead of F16 increases achievable context by about 30K without quality loss, and may even improve results in some tests.
It reports specific generation settings (--image-min-tokens 300 and --image-max-tokens 512) that help reach 60K+ total context while keeping vision enabled with an FP16 cache.
A Hugging Face link is provided to the specific Q8 mmproj file used for Gemma 4 26B.
The author notes an upcoming fix for a regression in post b8660 llama.cpp builds, recommending users update after the merge.
Overall, the guidance is framed as a practical optimization for local multimodal (vision-enabled) Gemma 4 runs to improve context length under limited hardware budgets.

Hey guys, quick follow up to my post yesterday about running Gemma 4 26B.

I kept testing and realized you can just use the Q8_0 mmproj for vision instead of F16. There is no quality drop, and it actually performed a bit better in a few of my tests (with --image-min-tokens 300 --image-max-tokens 512). You can easily hit 60K+ total context with an FP16 cache and still keep vision enabled.

Here is the Q8 mmproj I used : https://huggingface.co/prithivMLmods/gemma-4-26B-A4B-it-F32-GGUF/blob/main/GGUF/gemma-4-26B-A4B-it.mmproj-q8_0.gguf

Link to original post (and huge thanks to this comment for the tip!).

Quick heads up: Regarding the regression on post b8660 builds, a fix has already been approved and will be merged soon. Make sure to update it after the merge.

submitted by /u/Sadman782
[link] [comments]