Anticipation-VLA: Solving Long-Horizon Embodied Tasks via Anticipation-based Subgoal Generation

arXiv cs.RO / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • Vision-Language-Action (VLA) models can translate language and visual input into robot actions, but they often fail on long-horizon tasks because errors compound over time.
  • Prior approaches split tasks into fixed-granularity subtasks, which cannot flexibly match how execution state complexity changes during a task.
  • The paper introduces an Anticipation Model that adaptively and recursively generates future subgoals, updating them as task dynamics evolve to improve planning reliability.
  • It proposes Anticipation-VLA, a hierarchical framework that uses the anticipation-based subgoal generator to produce actionable goals for a low-level, goal-conditioned VLA policy.
  • Experiments in simulation and real-world robotic settings indicate that adaptive, recursive subgoal generation improves robustness and effectiveness for long-horizon embodied tasks.

Abstract

Vision-Language-Action (VLA) models have emerged as a powerful paradigm for embodied intelligence, enabling robots to perform tasks based on natural language instructions and current visual input. However, existing VLA models struggle with long-horizon tasks due to compounding errors. Prior methods decompose tasks into subtasks of fixed granularity, which cannot adapt to the varying complexity of execution states, limiting their robustness in long-horizon tasks. To overcome this, we introduce Anticipation Model, which adaptively and recursively generates future subgoals. This model continuously adapts as the task unfolds, adjusting future subgoals in response to evolving dynamics, facilitating more reliable planning paths. Building on this concept, we propose Anticipation-VLA, a hierarchical VLA model that leverages the anticipation model to generate actionable subgoals that guide VLA policy execution. We implement Anticipation-VLA with finetuning a Unified Multimodal Model (UMM) for high-level subgoal generation and a goal-conditioned VLA policy for low-level action execution. Experiments in both simulated and real-world robotic tasks demonstrate the effectiveness of Anticipation-VLA, highlighting the importance of adaptive and recursive subgoal generation for robust policy execution.