Gemma 4 on Llama.cpp should be stable now

Reddit r/LocalLLaMA / 4/9/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

Key Points

  • The author reports that, after a specific merge into llama.cpp (PR #21534), previously known Gemma 4 issues are resolved and Gemma 4 should now run more stably on the current llama.cpp source code (master).
  • They recommend running Gemma 4 31B with the interleaved chat template via `--chat-template-file`, pointing to the template file included in the llama.cpp repo under `models/templates`.
  • For reliability and performance, the post suggests using `--cache-ram 2048 -ctxcp 2` to reduce the risk of system RAM problems.
  • The author notes that using mixed-precision KV cache settings (Q5 K and Q4 V) has not shown major performance degradation in their testing, while acknowledging results may vary.
  • They caution builders not to use CUDA 13.2 because it is confirmed broken and can produce non-working builds, while NVIDIA is addressing the issue.

With the merging of https://github.com/ggml-org/llama.cpp/pull/21534, all of the fixes to known Gemma 4 issues in Llama.cpp have been resolved. I've been running Gemma 4 31B on Q5 quants for some time now with no issues.

Runtime hints:

  • remember to run with `--chat-template-file` with the interleaved template Aldehir has prepared (it's in the llama.cpp code under models/templates)
  • I strongly encourage running with `--cache-ram 2048 -ctxcp 2` to avoid system RAM problems
  • running KV cache with Q5 K and Q4 V has shown no large performance degradation, of course YMMV

Have fun :)

(oh yeah, important remark - when I talk about llama.cpp here, I mean the *source code*, not the releases which lag behind - this refers to the code built from current master)

Important note about building: DO NOT currently use CUDA 13.2 as it is CONFIRMED BROKEN (the NVidia people are on the case already) and will generate builds that will not work correctly.

submitted by /u/ilintar
[link] [comments]