Three-Step Nav: A Hierarchical Global-Local Planner for Zero-Shot Vision-and-Language Navigation
arXiv cs.CV / 4/30/2026
📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research
Key Points
- The paper introduces “Three-Step Nav,” a hierarchical global–local planning method for zero-shot vision-and-language navigation using multimodal LLMs.
- It addresses common MLLM-VLN failures—such as drifting off course and stopping too early—via a three-view protocol: look forward (global landmarks and coarse plan), look now (align observations to the next sub-goal), and look backward (audit the full trajectory to correct drift before stopping).
- The approach requires no gradient updates and no task-specific fine-tuning, enabling it to plug into existing VLN pipelines with minimal overhead.
- Three-Step Nav reportedly achieves state-of-the-art zero-shot results on the R2R-CE and RxR-CE benchmarks, and the authors release code on GitHub.
Related Articles

Chinese firms face pressure on AI investments as US peers’ spending keeps soaring
SCMP Tech
The Prompt Caching Mistake That's Costing You 70% More Than You Need to Pay
Dev.to
We Built a DNS-Based Discovery Protocol for AI Agents — Here's How It Works
Dev.to
Your first business opportunity in 3 commands: /register_directory in @biznode_bot, wait for matches, then /my_pulse to view...
Dev.to
Building AI Evaluation Pipelines: Automating LLM Testing from Dataset to CI/CD
Dev.to