LLMs for Text-Based Exploration and Navigation Under Partial Observability

arXiv cs.AI / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies whether large language models can serve as text-only controllers for exploration and navigation in unknown environments under partial observability, without tools or code execution.
It introduces a reproducible benchmark using fixed ASCII gridworlds with oracle localization, where each move reveals only a local 5×5 view and the model must choose among UP/RIGHT/DOWN/LEFT.
Across nine LLMs (various architectures and tuning styles), reasoning-tuned models most reliably complete navigation across layouts, though they are typically less efficient than oracle (shortest) paths.
Few-shot prompts mainly improve reasoning-tuned models by reducing invalid actions and shortening trajectories, while dense instruction-tuned models show more inconsistency.
The authors find action priors (e.g., UP/RIGHT) can cause looping under partial observability and conclude that training regimen and test-time deliberation predict control ability better than sheer parameter count, suggesting hybridization with online planners for practical systems.

Abstract

Exploration and goal-directed navigation in unknown layouts are central to inspection, logistics, and search-and-rescue. We ask whether large language models (LLMs) can function as \emph{text-only} controllers under partial observability -- without code execution, tools, or program synthesis. We introduce a reproducible benchmark with oracle localisation in fixed ASCII gridworlds: each step reveals only a local

5\times5

window around the agent and the model must select one of \texttt{UP/RIGHT/DOWN/LEFT}. Nine contemporary LLMs ranging from open/proprietary, dense / Mixture of Experts and instruction- vs. reasoning-tuned are evaluated on two tasks across three layouts of increasing difficulty: \emph{Exploration} (maximising revealed cells) and \emph{Navigation} (reach the goal on the shortest path). The experimental results are evaluated on quantitative metrics including \emph{success rate}, \emph{efficiency} such as normalised coverage and \emph{path length} vs. oracle as well as qualitative analysis. Reasoning-tuned models reliably complete navigation across all layouts, yet remain less efficient than oracle paths. Few-shot demonstrations in the prompt chiefly help these Reasoning-tuned models by reducing invalid moves and shortening paths, while classic dense instruction models remain inconsistent. We observe characteristic action priors (UP/RIGHT) that can induce looping under partial observability. Overall, training regimen and test-time deliberation predict control ability better than raw parameter count. These findings suggest lightweight hybridisation with classical online planners as a practical route to deployable partial map systems.