With the merging of https://github.com/ggml-org/llama.cpp/pull/21534, all of the fixes to known Gemma 4 issues in Llama.cpp have been resolved. I've been running Gemma 4 31B on Q5 quants for some time now with no issues.
Runtime hints:
- remember to run with `--chat-template-file` with the interleaved template Aldehir has prepared (it's in the llama.cpp code under models/templates)
- I strongly encourage running with `--cache-ram 2048 -ctxcp 2` to avoid system RAM problems
- running KV cache with Q5 K and Q4 V has shown no large performance degradation, of course YMMV
Have fun :)
(oh yeah, important remark - when I talk about llama.cpp here, I mean the *source code*, not the releases which lag behind - this refers to the code built from current master)
Important note about building: DO NOT currently use CUDA 13.2 as it is CONFIRMED BROKEN (the NVidia people are on the case already) and will generate builds that will not work correctly.
[link] [comments]



