From Pixels to BFS: High Maze Accuracy Does Not Imply Visual Planning

arXiv cs.LG / 3/31/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces MazeBench, a benchmark of 110 procedurally generated maze images designed to test whether multimodal models perform genuine visual planning or instead rely on token-space brute-force search.
Experiments across 16 model configurations show strong leaderboard accuracy for models like GPT-5.4 (91%) and Gemini 3.1 Pro (79%), but the authors argue these results are misleading because models often enumerate paths step-by-step using large numbers of tokens.
When models do not receive additional reasoning budgets, performance collapses to 2–12%, and on 20×20 ultra-hard mazes many models hit token limits and fail, indicating limited robustness.
Qualitative analysis finds a consistent two-stage approach: translating the image into a text grid and then running BFS-like path search in prose; a grid ablation further demonstrates that weak visual extraction, not downstream search per se, drives much of the failure.
The benchmark concludes that high accuracy on visual spatial tasks does not necessarily imply human-like spatial understanding or planning.

Abstract

How do multimodal models solve visual spatial tasks -- through genuine planning, or through brute-force search in token space? We introduce \textsc{MazeBench}, a benchmark of 110 procedurally generated maze images across nine controlled groups, and evaluate 16 model configurations from OpenAI, Anthropic, Google, and Alibaba. GPT-5.4 solves 91\% and Gemini 3.1 Pro 79\%, but these scores are misleading: models typically translate images into text grids and then enumerate paths step by step, consuming 1,710--22,818 tokens per solve for a task humans do quickly. Without added reasoning budgets, all configurations score only 2--12\%; on 20

\times

20 ultra-hard mazes, they hit token limits and fail. Qualitative traces reveal a common two-stage strategy: image-to-grid translation followed by token-level search, effectively BFS in prose. A text-grid ablation shows Claude Sonnet 4.6 rising from 6\% on images to 80\% when given the correct grid, isolating weak visual extraction from downstream search. When explicitly instructed not to construct a grid or perform graph search, models still revert to the same enumeration strategy. \textsc{MazeBench} therefore shows that high accuracy on visual planning tasks does not imply human-like spatial understanding.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/31DailyView insight →

Black Hat Asia

AI Business

[D] How does distributed proof of work computing handle the coordination needs of neural network training?

Reddit r/MachineLearning

Claude Code's Entire Source Code Was Just Leaked via npm Source Maps — Here's What's Inside

Dev.to

BYOK is not just a pricing model: why it changes AI product trust

Dev.to

AI Citation Registries and Identity Persistence Across Records

Dev.to

From Pixels to BFS: High Maze Accuracy Does Not Imply Visual Planning

Key Points

Abstract

💡 Insights using this article

Related Articles

Black Hat Asia

[D] How does distributed proof of work computing handle the coordination needs of neural network training?

Claude Code's Entire Source Code Was Just Leaked via npm Source Maps — Here's What's Inside

BYOK is not just a pricing model: why it changes AI product trust

AI Citation Registries and Identity Persistence Across Records

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer