Policy-Guided World Model Planning for Language-Conditioned Visual Navigation

arXiv cs.AI / 3/30/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper tackles instruction-conditioned visual navigation by addressing two common weaknesses: long-horizon planning limits in reactive policies and poor action initialization in world-model planners within high-dimensional spaces.
  • It proposes PiJEPA, a two-stage framework that first fine-tunes an Octo-based language-conditioned navigation policy using CAST, with a frozen vision encoder (DINOv2 or V-JEPA-2) to generate an informed action distribution.
  • In the second stage, PiJEPA warm-starts Model Predictive Path Integral (MPPI) planning using the policy-derived distribution rather than an uninformed Gaussian, improving faster convergence to high-quality action sequences.
  • The approach uses a separately trained JEPA world model to predict future latent states in the same vision-encoder embedding space, enabling latent-space planning consistent with the perception module.
  • Experiments on real-world navigation tasks show PiJEPA outperforms both standalone policy execution and uninformed world-model planning, with systematic comparisons of DINOv2 vs V-JEPA-2 backbones across policy and world-model components.

Abstract

Navigating to a visually specified goal given natural language instructions remains a fundamental challenge in embodied AI. Existing approaches either rely on reactive policies that struggle with long-horizon planning, or employ world models that suffer from poor action initialization in high-dimensional spaces. We present PiJEPA, a two-stage framework that combines the strengths of learned navigation policies with latent world model planning for instruction-conditioned visual navigation. In the first stage, we finetune an Octo-based generalist policy, augmented with a frozen pretrained vision encoder (DINOv2 or V-JEPA-2), on the CAST navigation dataset to produce an informed action distribution conditioned on the current observation and language instruction. In the second stage, we use this policy-derived distribution to warm-start Model Predictive Path Integral (MPPI) planning over a separately trained JEPA world model, which predicts future latent states in the embedding space of the same frozen encoder. By initializing the MPPI sampling distribution from the policy prior rather than from an uninformed Gaussian, our planner converges faster to high-quality action sequences that reach the goal. We systematically study the effect of the vision encoder backbone, comparing DINOv2 and V-JEPA-2, across both the policy and world model components. Experiments on real-world navigation tasks demonstrate that PiJEPA significantly outperforms both standalone policy execution and uninformed world model planning, achieving improved goal-reaching accuracy and instruction-following fidelity.