ProgressVLA: Progress-Guided Diffusion Policy for Vision-Language Robotic Manipulation

arXiv cs.RO / 3/31/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • ProgressVLA is presented as a vision-language-action model for robotic manipulation that adds explicit task progress awareness, addressing a gap in prior VLA systems that rely on heuristics for termination.
  • The approach includes a robust progress estimator pre-trained on large unsupervised video-text robotic datasets, reporting low residual error in simulation and zero-shot generalization to unseen real-world samples.
  • It also introduces differentiable progress guidance using an inverse dynamics world model to predict future latent visual states from action tokens, which are then evaluated by the progress estimator.
  • Experiments on the CALVIN and LIBERO benchmarks, plus real-world robot deployment, report consistent improvements in success rates and generalization versus strong baselines.

Abstract

Most existing vision-language-action (VLA) models for robotic manipulation lack progress awareness, typically relying on hand-crafted heuristics for task termination. This limitation is particularly severe in long-horizon tasks involving cascaded sub-goals. In this work, we investigate the estimation and integration of task progress, proposing a novel model named {\textbf \vla}. Our technical contributions are twofold: (1) \emph{robust progress estimation}: We pre-train a progress estimator on large-scale, unsupervised video-text robotic datasets. This estimator achieves a low prediction residual (0.07 on a scale of [0, 1]) in simulation and demonstrates zero-shot generalization to unseen real-world samples, and (2) \emph{differentiable progress guidance}: We introduce an inverse dynamics world model that maps predicted action tokens into future latent visual states. These latents are then processed by the progress estimator; by applying a maximal progress regularization, we establish a differentiable pipeline that provides progress-piloted guidance to refine action tokens. Extensive experiments on the CALVIN and LIBERO benchmarks, alongside real-world robot deployment, consistently demonstrate substantial improvements in success rates and generalization over strong baselines.