From Pixels to BFS: High Maze Accuracy Does Not Imply Visual Planning

arXiv cs.LG / 3/31/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces MazeBench, a benchmark of 110 procedurally generated maze images designed to test whether multimodal models perform genuine visual planning or instead rely on token-space brute-force search.
  • Experiments across 16 model configurations show strong leaderboard accuracy for models like GPT-5.4 (91%) and Gemini 3.1 Pro (79%), but the authors argue these results are misleading because models often enumerate paths step-by-step using large numbers of tokens.
  • When models do not receive additional reasoning budgets, performance collapses to 2–12%, and on 20×20 ultra-hard mazes many models hit token limits and fail, indicating limited robustness.
  • Qualitative analysis finds a consistent two-stage approach: translating the image into a text grid and then running BFS-like path search in prose; a grid ablation further demonstrates that weak visual extraction, not downstream search per se, drives much of the failure.
  • The benchmark concludes that high accuracy on visual spatial tasks does not necessarily imply human-like spatial understanding or planning.

Abstract

How do multimodal models solve visual spatial tasks -- through genuine planning, or through brute-force search in token space? We introduce \textsc{MazeBench}, a benchmark of 110 procedurally generated maze images across nine controlled groups, and evaluate 16 model configurations from OpenAI, Anthropic, Google, and Alibaba. GPT-5.4 solves 91\% and Gemini 3.1 Pro 79\%, but these scores are misleading: models typically translate images into text grids and then enumerate paths step by step, consuming 1,710--22,818 tokens per solve for a task humans do quickly. Without added reasoning budgets, all configurations score only 2--12\%; on 20\times20 ultra-hard mazes, they hit token limits and fail. Qualitative traces reveal a common two-stage strategy: image-to-grid translation followed by token-level search, effectively BFS in prose. A text-grid ablation shows Claude Sonnet 4.6 rising from 6\% on images to 80\% when given the correct grid, isolating weak visual extraction from downstream search. When explicitly instructed not to construct a grid or perform graph search, models still revert to the same enumeration strategy. \textsc{MazeBench} therefore shows that high accuracy on visual planning tasks does not imply human-like spatial understanding.