Can Vision Foundation Models Navigate? Zero-Shot Real-World Evaluation and Lessons Learned

arXiv cs.LG / 3/30/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper presents a real-world, zero-shot evaluation of five state-of-the-art Visual Navigation Models (GNM, ViNT, NoMaD, NaviBridger, and CrossFormer) across two robot platforms and five indoor/outdoor environments rather than relying only on success rate.
It introduces richer assessment beyond reaching the goal, including path-based metrics, vision-based goal-recognition scores, and robustness tests using controlled image perturbations such as motion blur and sunflare.
The analysis finds recurring weaknesses: frequent collisions suggesting limited geometric understanding, difficulty distinguishing visually similar locations leading to goal prediction errors, and performance drops under distribution shift.
The authors plan to publicly release the evaluation codebase and dataset to support reproducible benchmarking of vision navigation models.

Abstract

Visual Navigation Models (VNMs) promise generalizable, robot navigation by learning from large-scale visual demonstrations. Despite growing real-world deployment, existing evaluations rely almost exclusively on success rate, whether the robot reaches its goal, which conceals trajectory quality, collision behavior, and robustness to environmental change. We present a real-world evaluation of five state-of-the-art VNMs (GNM, ViNT, NoMaD, NaviBridger, and CrossFormer) across two robot platforms and five environments spanning indoor and outdoor settings. Beyond success rate, we combine path-based metrics with vision-based goal-recognition scores and assess robustness through controlled image perturbations (motion blur, sunflare). Our analysis uncovers three systematic limitations: (a) even architecturally sophisticated diffusion and transformer-based models exhibit frequent collisions, indicating limited geometric understanding; (b) models fail to discriminate between different locations that are perceptually similar, however some semantics differences are present, causing goal prediction errors in repetitive environments; and (c) performance degrades under distribution shift. We will publicly release our evaluation codebase and dataset to facilitate reproducible benchmarking of VNMs.

Black Hat Asia

AI Business

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Simon Willison's Blog

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

Dev.to

I missed the "fun" part in software development

Dev.to

The Billion Dollar Tax on AI Agents

Dev.to

Can Vision Foundation Models Navigate? Zero-Shot Real-World Evaluation and Lessons Learned

Key Points

Abstract

Related Articles

Black Hat Asia

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

I missed the "fun" part in software development

The Billion Dollar Tax on AI Agents

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer