Policy-Guided World Model Planning for Language-Conditioned Visual Navigation
arXiv cs.AI / 3/30/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper tackles instruction-conditioned visual navigation by addressing two common weaknesses: long-horizon planning limits in reactive policies and poor action initialization in world-model planners within high-dimensional spaces.
- It proposes PiJEPA, a two-stage framework that first fine-tunes an Octo-based language-conditioned navigation policy using CAST, with a frozen vision encoder (DINOv2 or V-JEPA-2) to generate an informed action distribution.
- In the second stage, PiJEPA warm-starts Model Predictive Path Integral (MPPI) planning using the policy-derived distribution rather than an uninformed Gaussian, improving faster convergence to high-quality action sequences.
- The approach uses a separately trained JEPA world model to predict future latent states in the same vision-encoder embedding space, enabling latent-space planning consistent with the perception module.
- Experiments on real-world navigation tasks show PiJEPA outperforms both standalone policy execution and uninformed world-model planning, with systematic comparisons of DINOv2 vs V-JEPA-2 backbones across policy and world-model components.
Related Articles

Black Hat Asia
AI Business

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer
Simon Willison's Blog
Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026
Dev.to

I missed the "fun" part in software development
Dev.to

The Billion Dollar Tax on AI Agents
Dev.to