Gemma 4 fixes in llama.cpp

Reddit r/LocalLLaMA / 4/4/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

Key Points

  • Users note that Gemma quality issues may be specific to implementations, and many problems reportedly disappear when using the Gemma model via llama.cpp rather than other transformer pipelines.
  • The article points to multiple recent llama.cpp pull requests that include fixes relevant to Gemma 4, implying the ecosystem often patches model behavior after release.
  • It highlights that such fixes typically take days to propagate into llama.cpp after a model is first released.
  • A personal test is described where chat looping issues occurred in one scenario but not others, suggesting prompt choice or usage context can significantly affect observed problems.
  • The post encourages readers to check ongoing llama.cpp updates, indicating there may be additional fixes beyond the listed PRs.

There have already been opinions that Gemma is bad because it doesn’t work well, but you probably aren’t using the transformers implementation, you’re using llama.cpp.

After a model is released, you have to wait at least a few days for all the fixes in llama.cpp, for example:

https://github.com/ggml-org/llama.cpp/pull/21418

https://github.com/ggml-org/llama.cpp/pull/21390

https://github.com/ggml-org/llama.cpp/pull/21406

https://github.com/ggml-org/llama.cpp/pull/21327

https://github.com/ggml-org/llama.cpp/pull/21343

...and maybe there will be more?

I had a looping problem in chat, but I also tried doing some stuff in OpenCode (it wasn’t even coding), and there were zero problems. So, probably just like with GLM Flash, a better prompt somehow fixes the overthinking/looping.

submitted by /u/jacek2023
[link] [comments]