Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?

arXiv cs.RO / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper tests whether frontier video generation models (specifically Veo-3) can support generalizable robotic manipulation by predicting future image sequences from robot observations and using an inverse dynamics model (IDM) to recover robot actions.
The IDM is trained only on random-play data with no human supervision or expert demonstrations, aiming to map visually plausible trajectories into executable control signals.
In simulation and real-world experiments on a high-dimensional dexterous hand, the Veo-3+IDM approach produces approximately correct task-level trajectories but lacks sufficient low-level control accuracy for reliable completion of most tasks.
To address this limitation, the authors propose Veo-Act, a hierarchical framework that uses Veo-3 for high-level motion planning and a VLA policy for low-level execution, improving instruction-following performance of a state-of-the-art vision-language-action policy.
The results indicate that improving video generation models may make them increasingly useful as a component in generalizable robot learning pipelines, especially for planning and task-level guidance.

Abstract

Video generation models have advanced rapidly and are beginning to show a strong understanding of physical dynamics. In this paper, we investigate how far an advanced video generation model such as Veo-3 can support generalizable robotic manipulation. We first study a zero-shot approach in which Veo-3 predicts future image sequences from current robot observations, while an inverse dynamics model IDM recovers the corresponding robot actions. The IDM is trained solely on random-play data, requiring neither human supervision nor expert demonstrations. The key intuition is that, if a video model can generate physically plausible future motions in image space, an IDM can translate those visual trajectories into executable robot actions. We evaluate this "Veo-3+IDM" approach in both simulation and the real world using a high-dimensional dexterous hand. We find that, owing to the strong generalization capability of frontier video models, Veo-3+IDM can consistently generate approximately correct task-level trajectories. However, its low-level control accuracy remains insufficient to solve most tasks reliably. Motivated by this observation, we develop a hierarchical framework, Veo-Act, which uses Veo-3 as a high-level motion planner and a VLA policy as the low-level executor, significantly improving the instruction-following performance of a state-of-the-art vision-language-action policy. Overall, our results suggest that, as video generation models continue to improve, video models can be a valuable component for generalizable robot learning.

Black Hat Asia

AI Business

Amazon CEO takes aim at Nvidia, Intel, Starlink, more in annual shareholder letter

TechCrunch

Why Anthropic’s new model has cybersecurity experts rattled

Reddit r/artificial

Does the AI 2027 paper still hold any legitimacy?

Reddit r/artificial

Why Most Productivity Systems Fail (And What to Do Instead)

Dev.to

Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?

Key Points

Abstract

Related Articles

Black Hat Asia

Amazon CEO takes aim at Nvidia, Intel, Starlink, more in annual shareholder letter

Why Anthropic’s new model has cybersecurity experts rattled

Does the AI 2027 paper still hold any legitimacy?

Why Most Productivity Systems Fail (And What to Do Instead)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer