Before We Trust Them: Decision-Making Failures in Navigation of Foundation Models

arXiv cs.RO / 3/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that high benchmark success on navigation-like tasks can mask unreliable decision making in foundation models, motivating failure-focused evaluation rather than relying on aggregate accuracy.
  • It evaluates multiple current foundation models across six diagnostic navigation reasoning tasks under complete/incomplete spatial information and safety-relevant information, finding that major decision-making failures persist even when overall performance is strong.
  • In a path-planning scenario with unknown cells, GPT-5’s reported 93% success rate still included invalid paths, illustrating that remaining errors can be safety-critical rather than negligible.
  • The study finds newer models are not necessarily more reliable, with Gemini-2.5 Flash scoring 67% on an emergency-evacuation task and Gemini-2.0 Flash reaching 100% under the same conditions.
  • Across evaluations, the models show recurring failure modes including structural collapse, hallucinated reasoning, constraint violations, and unsafe decisions, indicating that these systems cannot be trusted for navigation without fine-grained testing.

Abstract

High success rates on navigation-related tasks do not necessarily translate into reliable decision making by foundation models. To examine this gap, we evaluate current models on six diagnostic tasks spanning three settings: reasoning under complete spatial information, reasoning under incomplete spatial information, and reasoning under safety-relevant information. Our results show that important decision-making failures can persist even when overall performance is strong, underscoring the need for failure-focused analysis to understand model limitations and guide future progress. In a path-planning setting with unknown cells, GPT-5 achieved a high success rate of 93%, yet the remaining cases still included invalid paths. We also find that newer models are not always more reliable than their predecessors. In reasoning under safety-relevant information, Gemini-2.5 Flash achieved only 67% on the challenging emergency-evacuation task, underperforming Gemini-2.0 Flash, which reached 100% under the same condition. Across all evaluations, models exhibited structural collapse, hallucinated reasoning, constraint violations, and unsafe decisions. These findings show that foundation models still exhibit substantial failures in navigation-related decision making and require fine-grained evaluation before they can be trusted. Project page: https://cmubig.github.io/before-we-trust-them/

Before We Trust Them: Decision-Making Failures in Navigation of Foundation Models | AI Navigate