Do LLMs Build Spatial World Models? Evidence from Grid-World Maze Tasks

arXiv cs.AI / 4/14/2026

💬 OpinionModels & Research

共有:

Key Points

The study tests whether LLMs can build internal spatial world models by using controlled grid-world maze tasks that require multi-step planning and spatial abstraction.
Results across Gemini-2.5-Flash, GPT-5-mini, Claude-Haiku-4.5, and DeepSeek-Chat show large failures in spatial reasoning, with performance dropping sharply when switching from tokenized adjacency representations (80–86% on small grids) to visual grid formats (16–34%).
Follow-up probes using sequential proximity and compositional distance questions find that high semantic coverage in reasoning traces (96–99%) does not translate into reliable spatial computations, implying the models do not accumulate spatial knowledge.
The authors conclude that LLM spatial reasoning is representation- and prompting-dependent, succeeding only in narrow conditions rather than forming robust, format-invariant spatial world models.
The findings raise concerns for deploying foundation models in applications that rely on consistent spatial abstraction for planning and reasoning.

Abstract

Foundation models have shown remarkable performance across diverse tasks, yet their ability to construct internal spatial world models for reasoning and planning remains unclear. We systematically evaluate the spatial understanding of large language models through maze tasks, a controlled testing context requiring multi-step planning and spatial abstraction. Across comprehensive experiments with Gemini-2.5-Flash, GPT-5-mini, Claude-Haiku-4.5, and DeepSeek-Chat, we uncover significant discrepancies in spatial reasoning that challenge assumptions about LLM planning capabilities. Using chain-of-thought prompting, Gemini achieves 80-86% accuracy on smaller mazes (5x5 to 7x7 grids) with tokenized adjacency representations, but performance collapses to 16-34% with visual grid formats, which is a 2-5x difference, suggesting representation-dependent rather than format-invariant spatial reasoning. We further probe spatial understanding through sequential proximity questions and compositional distance comparisons. Despite achieving 96-99% semantic coverage in reasoning traces, models fail to leverage this understanding for consistent spatial computations, indicating that they treat each question independently rather than building cumulative spatial knowledge. Our findings based on the maze-solving tasks suggest that LLMs do not develop robust spatial world models, but rather exhibit representation-specific and prompting-dependent reasoning that succeeds only under narrow conditions. These results have critical implications for deploying foundation models in applications requiring spatial abstraction.

Don't forget, there is more than forgetting: new metrics for Continual Learning

Dev.to

Microsoft MAI-Image-2-Efficient Review 2026: The AI Image Model Built for Production Scale

Dev.to

LLM Guard scored 0/8 detecting a Crescendo multi-turn attack. Arc Sentry flagged it at Turn 3.

Reddit r/artificial

My first impressions of Minimax M2.7 (Q5_K_M) vs Qwen 3.5 27b (Q8_0)

Reddit r/LocalLLaMA

Trusted access for the next era of cyber defense

Simon Willison's Blog

Do LLMs Build Spatial World Models? Evidence from Grid-World Maze Tasks

Key Points

Abstract

Related Articles

Don't forget, there is more than forgetting: new metrics for Continual Learning

Microsoft MAI-Image-2-Efficient Review 2026: The AI Image Model Built for Production Scale

LLM Guard scored 0/8 detecting a Crescendo multi-turn attack. Arc Sentry flagged it at Turn 3.

My first impressions of Minimax M2.7 (Q5_K_M) vs Qwen 3.5 27b (Q8_0)

Trusted access for the next era of cyber defense

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer