Gemma 4 on Llama.cpp should be stable now

Reddit r/LocalLLaMA / 4/9/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

共有:

Key Points

The author reports that, after a specific merge into llama.cpp (PR #21534), previously known Gemma 4 issues are resolved and Gemma 4 should now run more stably on the current llama.cpp source code (master).
They recommend running Gemma 4 31B with the interleaved chat template via `--chat-template-file`, pointing to the template file included in the llama.cpp repo under `models/templates`.
For reliability and performance, the post suggests using `--cache-ram 2048 -ctxcp 2` to reduce the risk of system RAM problems.
The author notes that using mixed-precision KV cache settings (Q5 K and Q4 V) has not shown major performance degradation in their testing, while acknowledging results may vary.
They caution builders not to use CUDA 13.2 because it is confirmed broken and can produce non-working builds, while NVIDIA is addressing the issue.

With the merging of https://github.com/ggml-org/llama.cpp/pull/21534, all of the fixes to known Gemma 4 issues in Llama.cpp have been resolved. I've been running Gemma 4 31B on Q5 quants for some time now with no issues.

Runtime hints:

remember to run with `--chat-template-file` with the interleaved template Aldehir has prepared (it's in the llama.cpp code under models/templates)
I strongly encourage running with `--cache-ram 2048 -ctxcp 2` to avoid system RAM problems
running KV cache with Q5 K and Q4 V has shown no large performance degradation, of course YMMV

Have fun :)

(oh yeah, important remark - when I talk about llama.cpp here, I mean the *source code*, not the releases which lag behind - this refers to the code built from current master)

Important note about building: DO NOT currently use CUDA 13.2 as it is CONFIRMED BROKEN (the NVidia people are on the case already) and will generate builds that will not work correctly.

submitted by /u/ilintar
[link] [comments]