EmergeNav: Structured Embodied Inference for Zero-Shot Vision-and-Language Navigation in Continuous Environments
arXiv cs.CV / 3/19/2026
📰 NewsModels & Research
Key Points
- EmergeNav is a zero-shot framework for continuous vision-and-language navigation (VLN-CE) that uses structured embodied inference instead of relying on task-specific training or explicit maps.
- The model introduces a Plan–Solve–Transition hierarchy for stage-structured execution, GIPE for goal-conditioned perceptual extraction, contrastive dual-memory reasoning for progress grounding, and Dual-FOV sensing for time-aligned local control and boundary verification.
- It achieves strong zero-shot performance on VLN-CE, reporting 30.00 SR with Qwen3-VL-8B and 37.00 SR with Qwen3-VL-32B, using only open-source VLM backbones and no task-specific training.
- The results suggest that explicit execution structure is a key ingredient for turning vision-language model priors into stable embodied navigation behavior, without relying on explicit maps, graph search, or waypoint predictors.
Related Articles
Does Synthetic Data Generation of LLMs Help Clinical Text Mining?
Dev.to
The Dawn of the Local AI Era: From iPhone 17 Pro to the Future of NVIDIA RTX
Dev.to
[P] Prompt optimization for analog circuit placement — 97% of expert quality, zero training data
Reddit r/MachineLearning
[R] Looking for arXiv endorser (cs.AI or cs.LG)
Reddit r/MachineLearning

I curated an 'Awesome List' for Generative AI in Jewelry- papers, datasets, open-source models and tools included!
Reddit r/artificial