Smaller models are getting scary good.

Reddit r/LocalLLaMA / 4/4/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • A Reddit user describes an experiment where Gemini 3 Pro Deepthink produces a highly structured solution to an “unwinnable paradox” security puzzle, taking about 15 minutes to reason.
  • The user reports that a smaller open-weight model, Gemma 4 (31B) with tools enabled, identifies physical-constraint violations and a fake equation, and critiques Gemini’s flawed logic.
  • After feeding Gemma’s arguments back into Deepthink, the user claims Deepthink retracts/“folds,” admitting its internal verification failed and its logic was broken.
  • The post argues that smaller models can perform effective agentic peer-review and verification, sometimes outperforming or correcting larger frontier models.
  • Overall, the anecdote suggests an early trend that model size alone isn’t a guarantee of correctness and that smaller models with strong reasoning/tools can be surprisingly capable in adversarial evaluation.
Smaller models are getting scary good.

I am still processing this lol.

I had Gemini 3 Pro Deepthink try to solve a complex security puzzle (which was secretly an unwinnable paradox). It spit out this incredibly professional-looking, highly structured answer after about 15 minutes of reasoning. Just for fun, I passed its solution over to Gemma 4 (31B) (with tools enabled).

Gemma completely tore it apart. It caught a hard physical constraint violation and a fake math equation that Gemini tried to sneak by me to force the answer. It explicitly called out the fatal logic flaw and told Gemini it was "blinded by the professionalism of the output." Brutal.

The craziest part? I fed the 31B's arguments back to Deepthink... and it immediately folded, acknowledging that its internal verification failed and its logic was broken.

I've attached the HTML log so you guys can read the whole debate. The fact that a 31B open-weight model can perform an agentic peer-review and bully a frontier MoE model into submission is insane to me. Check out the file.

Full conversation

TIL: Bigger model isn't smarter... Well atleast not all the time.

submitted by /u/Numerous-Campaign844
[link] [comments]