Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?
arXiv cs.RO / 4/7/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper tests whether frontier video generation models (specifically Veo-3) can support generalizable robotic manipulation by predicting future image sequences from robot observations and using an inverse dynamics model (IDM) to recover robot actions.
- The IDM is trained only on random-play data with no human supervision or expert demonstrations, aiming to map visually plausible trajectories into executable control signals.
- In simulation and real-world experiments on a high-dimensional dexterous hand, the Veo-3+IDM approach produces approximately correct task-level trajectories but lacks sufficient low-level control accuracy for reliable completion of most tasks.
- To address this limitation, the authors propose Veo-Act, a hierarchical framework that uses Veo-3 for high-level motion planning and a VLA policy for low-level execution, improving instruction-following performance of a state-of-the-art vision-language-action policy.
- The results indicate that improving video generation models may make them increasingly useful as a component in generalizable robot learning pipelines, especially for planning and task-level guidance.
Related Articles

Black Hat Asia
AI Business

Amazon CEO takes aim at Nvidia, Intel, Starlink, more in annual shareholder letter
TechCrunch

Why Anthropic’s new model has cybersecurity experts rattled
Reddit r/artificial
Does the AI 2027 paper still hold any legitimacy?
Reddit r/artificial

Why Most Productivity Systems Fail (And What to Do Instead)
Dev.to