Progress-Think: Semantic Progress Reasoning for Vision-Language Navigation

arXiv cs.RO / 4/15/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes Progress-Think, a method for Vision-Language Navigation that models “semantic progress” over long-horizon, multi-step instructions rather than only local visual context or direct action prediction.
  • It argues that existing approaches miss the monotonic co-progression property between observation history and instruction prefixes, motivating progress reasoning derived from visual observations.
  • Progress-Think uses a three-stage training framework: Self-Aligned Progress Pretraining with differentiable alignment, Progress-Guided Policy Pretraining that injects learned progress states into navigation context, and Progress-Policy Co-Finetuning with progress-aware reinforcement objectives.
  • Experiments on R2R-CE and RxR-CE report state-of-the-art results for navigation success and efficiency, suggesting semantic progress improves consistency of representation for navigation advancement.

Abstract

Vision-Language Navigation requires agents to act coherently over long horizons by understanding not only local visual context but also how far they have advanced within a multi-step instruction. However, recent Vision-Language-Action models focus on direct action prediction and earlier progress methods predict numeric achievements; both overlook the monotonic co-progression property of the observation and instruction sequences. Building on this insight, Progress-Think introduces semantic progress reasoning, predicting instruction-style progress from visual observations to enable more accurate navigation. To achieve this without expensive annotations, we propose a three-stage framework. In the initial stage, Self-Aligned Progress Pretraining bootstraps a reasoning module via a novel differentiable alignment between visual history and instruction prefixes. Then, Progress-Guided Policy Pretraining injects learned progress states into the navigation context, guiding the policy toward consistent actions. Finally, Progress-Policy Co-Finetuning jointly optimizes both modules with tailored progress-aware reinforcement objectives. Experiments on R2R-CE and RxR-CE show state-of-the-art success and efficiency, demonstrating that semantic progress yields a more consistent representation of navigation advancement.