Action Images: End-to-End Policy Learning via Multiview Video Generation
arXiv cs.RO / 4/8/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces “Action Images,” a unified world action model that frames robot policy learning as multiview video generation rather than relying on separate action modules or non–pixel-grounded action tokens.
- It represents 7-DoF robot actions as interpretable, multi-view “action videos” grounded in 2D pixels that explicitly track robot-arm motion, enabling the underlying video backbone to serve as a zero-shot policy.
- The approach removes the need for a dedicated policy head/action module and is designed to improve transfer across viewpoints and environments by leveraging pretrained video models more directly.
- Beyond policy learning, the shared representation also supports video-action joint generation, action-conditioned video generation, and action labeling, suggesting a versatile multimodal framework.
- Experiments on RLBench and real-world settings report the strongest zero-shot success rates and improved joint generation quality versus prior video-space world models, highlighting the benefit of pixel-grounded action representations.
Related Articles

Black Hat Asia
AI Business

The enforcement gap: why finding issues was never the problem
Dev.to

How I Built AI-Powered Auto-Redaction Into a Desktop Screenshot Tool
Dev.to

Agentic AI vs Traditional Automation: Why They Require Different Approaches in Modern Enterprises
Dev.to

Agentic AI vs Traditional Automation: Why Modern Enterprises Must Treat Them Differently
Dev.to