[R] Extreme Sudoku as a constraint-satisfaction benchmark, solved natively without tools or CoT or solution backtracking

Reddit r/MachineLearning / 3/19/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The post discusses the 'Sudoku Extreme' benchmark, a constraint-satisfaction task with ~250,000 hard instances where solutions are easy to verify and leading LLMs score near zero while a BDH architecture achieves 97.4% accuracy without chain-of-thought or backtracking.
It argues that transformers’ token-by-token continuation with limited internal state makes search-heavy reasoning hard, because maintaining multiple candidate worlds and revising assumptions is difficult without external tools.
The piece asks whether language-only reasoning can be pushed further or whether we need architectures with stronger internal memory or latent reasoning spaces to solve such tasks natively.
It suggests these results challenge current trends in scaling chain-of-thought and longer contexts, with broad implications for how researchers frame reasoning capabilities in AI.

I came across an interesting writeup from Pathway that I think is more interesting as a reasoning benchmark than as a puzzle result.

They use “Sudoku Extreme”: about 250,000 very hard Sudoku instances. The appeal is that Sudoku here is treated as a pure constraint-satisfaction problem: each solution is trivial to verify, hard to bluff and the task isn’t naturally linguistic. According to their numbers, leading LLMs (O3‑mini, DeepSeek R1, Claude 3.7 8K) all get 0% accuracy on this benchmark, while their BDH architecture reaches 97.4% accuracy without chain‑of‑thought traces or explicit solution backtracking.

What caught my attention is not just the reported result, but the mechanism claim: transformers do token‑by‑token continuation with a relatively limited internal state per step, which is a bad fit for search‑heavy reasoning where you want to keep multiple candidate worlds in play, revise earlier assumptions and converge under tight constraints. Writing a Python solver or calling tools “works,” but that’s a different capability than solving the constraint problem natively.

Given how much recent work is about scaling up chain‑of‑thought and longer contexts, I think this raises some uncomfortable questions for transformer‑centric reasoning: 1. If a model can’t handle a large, clean constraint‑satisfaction benchmark without external tools, how far can language‑only reasoning really be pushed? 2. Are we mostly rewarding longer verbalizations of search, instead of building architectures that actually perform search internally? 3. Do we need a different reasoning substrate (e.g., richer latent/continuous reasoning spaces with stronger internal memory) for these tasks, or can transformers realistically get there with enough scaffolding?

Edit: I’ve put the blog link and paper/benchmark details in the comments so it doesn’t clutter the post body.

submitted by /u/THEGAM3CHANG3R
[link] [comments]