EmergeNav: Structured Embodied Inference for Zero-Shot Vision-and-Language Navigation in Continuous Environments
arXiv cs.CV / 3/19/2026
📰 NewsModels & Research
Key Points
- EmergeNav is a zero-shot framework for continuous vision-and-language navigation (VLN-CE) that uses structured embodied inference instead of relying on task-specific training or explicit maps.
- The model introduces a Plan–Solve–Transition hierarchy for stage-structured execution, GIPE for goal-conditioned perceptual extraction, contrastive dual-memory reasoning for progress grounding, and Dual-FOV sensing for time-aligned local control and boundary verification.
- It achieves strong zero-shot performance on VLN-CE, reporting 30.00 SR with Qwen3-VL-8B and 37.00 SR with Qwen3-VL-32B, using only open-source VLM backbones and no task-specific training.
- The results suggest that explicit execution structure is a key ingredient for turning vision-language model priors into stable embodied navigation behavior, without relying on explicit maps, graph search, or waypoint predictors.
Related Articles

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.
Reddit r/LocalLLaMA
QwenDean-4B | fine-tuned SLM for UIGen; our first attempt, looking for feedback!
Reddit r/LocalLLaMA
acestep.cpp: portable C++17 implementation of ACE-Step 1.5 music generation using GGML. Runs on CPU, CUDA, ROCm, Metal, Vulkan
Reddit r/LocalLLaMA

**Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding**
Hugging Face Blog

Newest GPU server in the lab! 72gb ampere vram!
Reddit r/LocalLLaMA