Got Gemma 4 running locally on CUDA, both float and GGUF quantized, with benchmarks

Reddit r/LocalLLaMA / 4/7/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The post documents how the author got Gemma 4 running locally on CUDA in both BF16 full-precision and GGUF quantized (Q4_K_M) modes, including observed token-per-second benchmarks on an RTX 3090.
A central finding is that Gemma 4 uses attention_scale=1.0, which makes it far more sensitive to precision errors than typical transformer implementations and can yield silent, quickly-degrading outputs.
The author reports that several common optimizations can fail on Gemma 4 (e.g., F16 KV cache, certain fused attention kernels, and Flash Attention v1 with head_dim=512), leading to divergence or all-zero logits.
As a workaround, they recommend avoiding any dtype conversion at the KV cache boundary and keeping model-weight precision aligned with KV cache precision (BF16→BF16 KV cache; F32 GGUF→F32 KV cache), while using F32 for internal attention math.
Additional notes highlight architectural “hybrid attention” constraints (standard SDPA/FlashAttention incompatibilities at specific head dimensions), KV cache sharing saving ~57% memory, and an assessment that Gemma 4’s design is notably different from common LLaMA-like variants.

Spent the last week getting Gemma 4 working on CUDA with both full-precision (BF16) and GGUF quantized inference. Here's a video of it running. Sharing some findings because this model has some quirks that aren't obvious.

Performance (Gemma4 E2B, RTX 3090):

| Config | BF16 Float | Q4_K_M GGUF | |-------------------------|------------|-------------| | short gen (p=1, g=32) | 110 tok/s | 170 tok/s | | long gen (p=512, g=128) | 72 tok/s | 93 tok/s |

The precision trap nobody warns you about

Honestly making it work was harder than I though.

Gemma 4 uses attention_scale=1.0 (QK-norm instead of the usual 1/sqrt(d_k) scaling). This makes it roughly 22x more sensitive to precision errors than standard transformers. Things that work fine on LLaMA or Qwen will silently produce garbage on Gemma 4:

F16 KV cache? Precision loss compounds across decode steps and output degenerates after ~50 tokens
Fused attention kernels? Token divergence after ~4 steps
Flash attention v1 with head_dim=512? All-zero logits (kernel bug)

The rule I landed on: no dtype conversion at the KV cache boundary. BF16 model = BF16 KV cache with F32 internal attention math. F32 GGUF = F32 KV cache. Mixing dtypes between model weights and cache is where things break.

Once I got the precision right, output matches Python transformers token-for-token (verified first 30 tokens against HF fixtures).

Other things worth knowing:

The hybrid attention (sliding window local + full global with head_dim=512) means you can't just drop in standard SDPA, as Metal's SDPA caps at head_dim=256, and Flash Attention v1 has a kernel bug at 512
KV cache sharing across the last N layers saves ~57% KV memory, nice for fitting on consumer cards
The architecture is genuinely novel (dual RoPE configs, per-layer embeddings, sandwich norms), not just another LLaMA variant, which is cool. Still wish the attention scaling was there so that precision was not so much an issue

Anyone else running Gemma 4 locally? Curious if others hit the same precision issues or found workarounds I missed.

https://reddit.com/link/1sebwz2/video/9zbou0jvzmtg1/player

submitted by /u/_w4nderlust_
[link] [comments]