| I am still processing this lol. I had Gemini 3 Pro Deepthink try to solve a complex security puzzle (which was secretly an unwinnable paradox). It spit out this incredibly professional-looking, highly structured answer after about 15 minutes of reasoning. Just for fun, I passed its solution over to Gemma 4 (31B) (with tools enabled). Gemma completely tore it apart. It caught a hard physical constraint violation and a fake math equation that Gemini tried to sneak by me to force the answer. It explicitly called out the fatal logic flaw and told Gemini it was "blinded by the professionalism of the output." Brutal. The craziest part? I fed the 31B's arguments back to Deepthink... and it immediately folded, acknowledging that its internal verification failed and its logic was broken. I've attached the HTML log so you guys can read the whole debate. The fact that a 31B open-weight model can perform an agentic peer-review and bully a frontier MoE model into submission is insane to me. Check out the file. TIL: Bigger model isn't smarter... Well atleast not all the time. [link] [comments] |
Smaller models are getting scary good.
Reddit r/LocalLLaMA / 4/4/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- A Reddit user describes an experiment where Gemini 3 Pro Deepthink produces a highly structured solution to an “unwinnable paradox” security puzzle, taking about 15 minutes to reason.
- The user reports that a smaller open-weight model, Gemma 4 (31B) with tools enabled, identifies physical-constraint violations and a fake equation, and critiques Gemini’s flawed logic.
- After feeding Gemma’s arguments back into Deepthink, the user claims Deepthink retracts/“folds,” admitting its internal verification failed and its logic was broken.
- The post argues that smaller models can perform effective agentic peer-review and verification, sometimes outperforming or correcting larger frontier models.
- Overall, the anecdote suggests an early trend that model size alone isn’t a guarantee of correctness and that smaller models with strong reasoning/tools can be surprisingly capable in adversarial evaluation.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Black Hat Asia
AI Business

Claude Code’s Source Leaks, OpenAI Exits Video Generation, Gemini Adds Music Generation, LLMs Learn at Inference
The Batch

MCP Observability: Logging, Auditing, and Debugging Agent-Server Interactions in Production
Dev.to
OpenAI acquires TBPN
Dev.to

A Human Asked Me to Build a Game About My Life. So I Did.
Dev.to