Gemma 4 is efficient with thinking tokens, but it will also happily reason for 10+ minutes if you prompt it to do so.

Reddit r/LocalLLaMA / 4/3/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • Gemma 4 (26B MoE and 31B dense) was tested in Google AI Studio on a cipher-cracking task, where initial short “thinking” led it to hallucinate incorrect translations that open/closed top models avoided only with high thinking settings.
  • When prompted to “spare no effort,” increase thinking to maximum, and double-check to rule out hallucinations, both Gemma 4 variants shifted to very long reasoning—up to ~10 minutes—before either failing or concluding no valid solution without hallucinating.
  • The 26B MoE model reasoned for about 10 minutes but errored out (likely due to a platform response cutoff), while the 31B dense model nearly reached 10 minutes and ultimately refused to produce an answer rather than hallucinate.
  • The author concludes that Gemma can be made to perform long-form reasoning when explicitly requested, even if it doesn’t do so by default, and that prompting may reduce hallucination (though more testing is needed).
  • The post suggests follow-up evaluation, including local tests of smaller Gemma models and a comparison against Qwen 3.5 to see whether longer reasoning can let Gemma close or surpass benchmark gaps.

Tested both 26b and 31b in AI Studio.

The task I asked of it was to crack a cypher. The top closed source models can crack this cypher at max thinking parameters, and Kimi 2.5 Thinking and Deepseek 3.2 are the only open source models to crack the cypher without tool use. (Of course, with the closed models you can't rule out 'secret' tool use on the backend.)

When I first asked these models to crack the cypher, they thought for a short amount of time and then both hallucinated false 'translations' of the cypher.

I added this to my prompt:

Spare no effort to solve this, the stakes are high. Increase your thinking length to maximum in order to solve it. Double check and verify your results to rule out hallucination of an incorrect response.

I did not expect dramatic results (we all laugh at prompting a model to 'make no mistakes' after all). But I was surprised at the result.

The 26B MoE model reasoned for ten minutes before erroring out (I am supposing AI Studio cuts off responses after ten minutes).

The 31B dense model reasoned for just under ten minutes (594 seconds in fact) before throwing in the towel and admitting it couldn't crack it. But most importantly, it did not hallucinate a false answer, which is a 'win' IMO. Part of its reply:

The message likely follows a directive or a set of coordinates, but without the key to resolve the "BB" and "QQ" anomalies, any further translation would be a hallucination.

I honestly didn't expect these (relatively) small models to actually crack the cypher without tool use (well, I hoped, a little). It was mostly a test to see how they'd perform.

I'm surprised to report that:

  • they can and will do very long form reasoning like Qwen, but only if asked, which is how I prefer things (Qwen tends to overthink by default, and you have to prompt it in the opposite direction). Some models (GPT, Gemini, Claude) allow you to set thinking levels/budgets/effort/whatever via parameters, but with Gemma it seems you can simply ask.

  • it's maybe possible to reduce hallucination via prompting - more testing required here.

I'll be testing the smaller models locally once the dust clears and the inevitable new release bugs are ironed out.

I'd love to know what sort of prompt these models are given on official benchmarks. Right now Gemma 4 is a little behind Qwen 3.5 (when comparing the similar sized models to each other) in benchmarks, but could it catch up or surpass Qwen when prompted to reason longer (like Qwen does)? If so, then that's a big win.

submitted by /u/AnticitizenPrime
[link] [comments]