Do LLMs Build Spatial World Models? Evidence from Grid-World Maze Tasks

arXiv cs.AI / 4/14/2026

💬 OpinionModels & Research

Key Points

  • The study tests whether LLMs can build internal spatial world models by using controlled grid-world maze tasks that require multi-step planning and spatial abstraction.
  • Results across Gemini-2.5-Flash, GPT-5-mini, Claude-Haiku-4.5, and DeepSeek-Chat show large failures in spatial reasoning, with performance dropping sharply when switching from tokenized adjacency representations (80–86% on small grids) to visual grid formats (16–34%).
  • Follow-up probes using sequential proximity and compositional distance questions find that high semantic coverage in reasoning traces (96–99%) does not translate into reliable spatial computations, implying the models do not accumulate spatial knowledge.
  • The authors conclude that LLM spatial reasoning is representation- and prompting-dependent, succeeding only in narrow conditions rather than forming robust, format-invariant spatial world models.
  • The findings raise concerns for deploying foundation models in applications that rely on consistent spatial abstraction for planning and reasoning.

Abstract

Foundation models have shown remarkable performance across diverse tasks, yet their ability to construct internal spatial world models for reasoning and planning remains unclear. We systematically evaluate the spatial understanding of large language models through maze tasks, a controlled testing context requiring multi-step planning and spatial abstraction. Across comprehensive experiments with Gemini-2.5-Flash, GPT-5-mini, Claude-Haiku-4.5, and DeepSeek-Chat, we uncover significant discrepancies in spatial reasoning that challenge assumptions about LLM planning capabilities. Using chain-of-thought prompting, Gemini achieves 80-86% accuracy on smaller mazes (5x5 to 7x7 grids) with tokenized adjacency representations, but performance collapses to 16-34% with visual grid formats, which is a 2-5x difference, suggesting representation-dependent rather than format-invariant spatial reasoning. We further probe spatial understanding through sequential proximity questions and compositional distance comparisons. Despite achieving 96-99% semantic coverage in reasoning traces, models fail to leverage this understanding for consistent spatial computations, indicating that they treat each question independently rather than building cumulative spatial knowledge. Our findings based on the maze-solving tasks suggest that LLMs do not develop robust spatial world models, but rather exhibit representation-specific and prompting-dependent reasoning that succeeds only under narrow conditions. These results have critical implications for deploying foundation models in applications requiring spatial abstraction.