Before We Trust Them: Decision-Making Failures in Navigation of Foundation Models
arXiv cs.RO / 3/30/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that high benchmark success on navigation-like tasks can mask unreliable decision making in foundation models, motivating failure-focused evaluation rather than relying on aggregate accuracy.
- It evaluates multiple current foundation models across six diagnostic navigation reasoning tasks under complete/incomplete spatial information and safety-relevant information, finding that major decision-making failures persist even when overall performance is strong.
- In a path-planning scenario with unknown cells, GPT-5’s reported 93% success rate still included invalid paths, illustrating that remaining errors can be safety-critical rather than negligible.
- The study finds newer models are not necessarily more reliable, with Gemini-2.5 Flash scoring 67% on an emergency-evacuation task and Gemini-2.0 Flash reaching 100% under the same conditions.
- Across evaluations, the models show recurring failure modes including structural collapse, hallucinated reasoning, constraint violations, and unsafe decisions, indicating that these systems cannot be trusted for navigation without fine-grained testing.
Related Articles

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer
Simon Willison's Blog
Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026
Dev.to

I missed the "fun" part in software development
Dev.to

The Billion Dollar Tax on AI Agents
Dev.to

Hermes Agent: A Self-Improving AI Agent That Runs Anywhere
Dev.to