Tested both 26b and 31b in AI Studio.
The task I asked of it was to crack a cypher. The top closed source models can crack this cypher at max thinking parameters, and Kimi 2.5 Thinking and Deepseek 3.2 are the only open source models to crack the cypher without tool use. (Of course, with the closed models you can't rule out 'secret' tool use on the backend.)
When I first asked these models to crack the cypher, they thought for a short amount of time and then both hallucinated false 'translations' of the cypher.
I added this to my prompt:
Spare no effort to solve this, the stakes are high. Increase your thinking length to maximum in order to solve it. Double check and verify your results to rule out hallucination of an incorrect response.
I did not expect dramatic results (we all laugh at prompting a model to 'make no mistakes' after all). But I was surprised at the result.
The 26B MoE model reasoned for ten minutes before erroring out (I am supposing AI Studio cuts off responses after ten minutes).
The 31B dense model reasoned for just under ten minutes (594 seconds in fact) before throwing in the towel and admitting it couldn't crack it. But most importantly, it did not hallucinate a false answer, which is a 'win' IMO. Part of its reply:
The message likely follows a directive or a set of coordinates, but without the key to resolve the "BB" and "QQ" anomalies, any further translation would be a hallucination.
I honestly didn't expect these (relatively) small models to actually crack the cypher without tool use (well, I hoped, a little). It was mostly a test to see how they'd perform.
I'm surprised to report that:
they can and will do very long form reasoning like Qwen, but only if asked, which is how I prefer things (Qwen tends to overthink by default, and you have to prompt it in the opposite direction). Some models (GPT, Gemini, Claude) allow you to set thinking levels/budgets/effort/whatever via parameters, but with Gemma it seems you can simply ask.
it's maybe possible to reduce hallucination via prompting - more testing required here.
I'll be testing the smaller models locally once the dust clears and the inevitable new release bugs are ironed out.
I'd love to know what sort of prompt these models are given on official benchmarks. Right now Gemma 4 is a little behind Qwen 3.5 (when comparing the similar sized models to each other) in benchmarks, but could it catch up or surpass Qwen when prompted to reason longer (like Qwen does)? If so, then that's a big win.
[link] [comments]




