VRAM optimization for gemma 4

Reddit r/LocalLLaMA / 4/3/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

Gemma 4’s dense model can consume large VRAM up front due to the Sliding Window Attention (SWA) KV cache, which is allocated in F16 and may not be quantized like the rest of the KV cache.
A recent llama.cpp change briefly made the SWA portion unquantized even when KV cache quantization is enabled, but it was reverted shortly after, so users should run a recent build.
If you are running solo, adding `-np 1` to your llama.cpp launch command can cut SWA cache VRAM usage roughly by 3x (e.g., ~900MB→~300MB on the 26B model and ~3200MB→~1200MB on the 31B model).
Adjusting `-ub` (ubatch size) can significantly affect SWA buffer memory; keeping the default `-ub 512` is recommended rather than increasing it for speed when VRAM is limited.
On 16GB GPUs with the dense 31B model, viable operation may require lower quantization levels (e.g., IQ3/Q3_K) and reducing vision/mmproj memory to reach 30K+ context without OOM.

TLDR: add -np 1 to your llama.cpp launch command if you are the only user, cuts SWA cache VRAM by 3x instantly

So I was messing around with Gemma 4 and noticed the dense model hogs a massive chunk of VRAM before you even start generating anything. If you are on 16GB you might be hitting OOM and wondering why.

The culprit is the SWA (Sliding Window Attention) KV cache. It allocates in F16 and does not get quantized like the rest of the KV cache. A couple days ago ggerganov merged a PR that accidentally made this worse by keeping the SWA portion unquantized even when you have KV cache quantization enabled. It got reverted about 2 hours later here https://github.com/ggml-org/llama.cpp/pull/21332 so make sure you are on a recent build.

A few things that actually help with VRAM:

The SWA cache size is calculated as roughly (sliding window size × number of parallel sequences) + micro batch size. So if your server is defaulting to 4 parallel slots you are paying 3x the memory compared to a single user setup. Adding -np 1 to your launch command if you are just chatting solo cuts the SWA cache from around 900MB down to about 300MB on the 26B model and 3200MB to just 1200MB for the 31B dense model

Also watch out for -ub (ubatch size). The default is 512 and that is fine. If you or some guide told you to set -ub 4096 for speed, that bloats the SWA buffer massively. Just leave it at default unless you have VRAM to burn.

On 16GB with the dense 31B model you can still run decent context with IQ3 or Q3_K quantization but you will likely need to drop the mmproj (vision) to fit 30K+ context(fp16). With -np 1 and default ubatch it becomes much more manageable.

submitted by /u/Sadman782
[link] [comments]