Action Images: End-to-End Policy Learning via Multiview Video Generation

arXiv cs.RO / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces “Action Images,” a unified world action model that frames robot policy learning as multiview video generation rather than relying on separate action modules or non–pixel-grounded action tokens.
  • It represents 7-DoF robot actions as interpretable, multi-view “action videos” grounded in 2D pixels that explicitly track robot-arm motion, enabling the underlying video backbone to serve as a zero-shot policy.
  • The approach removes the need for a dedicated policy head/action module and is designed to improve transfer across viewpoints and environments by leveraging pretrained video models more directly.
  • Beyond policy learning, the shared representation also supports video-action joint generation, action-conditioned video generation, and action labeling, suggesting a versatile multimodal framework.
  • Experiments on RLBench and real-world settings report the strongest zero-shot success rates and improved joint generation quality versus prior video-space world models, highlighting the benefit of pixel-grounded action representations.

Abstract

World action models (WAMs) have emerged as a promising direction for robot policy learning, as they can leverage powerful video backbones to model the future states. However, existing approaches often rely on separate action modules, or use action representations that are not pixel-grounded, making it difficult to fully exploit the pretrained knowledge of video models and limiting transfer across viewpoints and environments. In this work, we present Action Images, a unified world action model that formulates policy learning as multiview video generation. Instead of encoding control as low-dimensional tokens, we translate 7-DoF robot actions into interpretable action images: multi-view action videos that are grounded in 2D pixels and explicitly track robot-arm motion. This pixel-grounded action representation allows the video backbone itself to act as a zero-shot policy, without a separate policy head or action module. Beyond control, the same unified model supports video-action joint generation, action-conditioned video generation, and action labeling under a shared representation. On RLBench and real-world evaluations, our model achieves the strongest zero-shot success rates and improves video-action joint generation quality over prior video-space world models, suggesting that interpretable action images are a promising route to policy learning.