Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?

arXiv cs.RO / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper tests whether frontier video generation models (specifically Veo-3) can support generalizable robotic manipulation by predicting future image sequences from robot observations and using an inverse dynamics model (IDM) to recover robot actions.
  • The IDM is trained only on random-play data with no human supervision or expert demonstrations, aiming to map visually plausible trajectories into executable control signals.
  • In simulation and real-world experiments on a high-dimensional dexterous hand, the Veo-3+IDM approach produces approximately correct task-level trajectories but lacks sufficient low-level control accuracy for reliable completion of most tasks.
  • To address this limitation, the authors propose Veo-Act, a hierarchical framework that uses Veo-3 for high-level motion planning and a VLA policy for low-level execution, improving instruction-following performance of a state-of-the-art vision-language-action policy.
  • The results indicate that improving video generation models may make them increasingly useful as a component in generalizable robot learning pipelines, especially for planning and task-level guidance.

Abstract

Video generation models have advanced rapidly and are beginning to show a strong understanding of physical dynamics. In this paper, we investigate how far an advanced video generation model such as Veo-3 can support generalizable robotic manipulation. We first study a zero-shot approach in which Veo-3 predicts future image sequences from current robot observations, while an inverse dynamics model IDM recovers the corresponding robot actions. The IDM is trained solely on random-play data, requiring neither human supervision nor expert demonstrations. The key intuition is that, if a video model can generate physically plausible future motions in image space, an IDM can translate those visual trajectories into executable robot actions. We evaluate this "Veo-3+IDM" approach in both simulation and the real world using a high-dimensional dexterous hand. We find that, owing to the strong generalization capability of frontier video models, Veo-3+IDM can consistently generate approximately correct task-level trajectories. However, its low-level control accuracy remains insufficient to solve most tasks reliably. Motivated by this observation, we develop a hierarchical framework, Veo-Act, which uses Veo-3 as a high-level motion planner and a VLA policy as the low-level executor, significantly improving the instruction-following performance of a state-of-the-art vision-language-action policy. Overall, our results suggest that, as video generation models continue to improve, video models can be a valuable component for generalizable robot learning.