Video2Act: A Dual-System Video Diffusion Policy with Robotic Spatio-Motional Modeling

arXiv cs.RO / 3/25/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • Video2Actは、ビデオ拡散モデル(VDM)がフレーム間に内在させる「空間表現の一貫性」や「物理的に整合した動き」を、ロボットの行動学習に明示的に統合する枠組みを提案しています。
  • 具体的には、VDMから前景境界とフレーム間の動きの変化(モーション変動)を抽出し、背景ノイズやタスク非関連のバイアスを抑えた表現を拡散トランスフォーマ(DiT)側の追加条件として用いて、何を操作しどう動くかを推論させます。
  • 推論の非効率を抑えるため、VDMを「遅いSystem 2」、DiTのアクションヘッドを「速いSystem 1」とする非同期のデュアルシステム設計を導入し、低頻度更新でも操作の安定性を維持する方針です。
  • 評価では、Video2ActがVLA(Vision-Language-Action)系の先行手法に対してシミュレーションで平均成功率7.7%、実環境で21.7%上回り、汎化性能も高いことを示しています。

Abstract

Robust perception and dynamics modeling are fundamental to real-world robotic policy learning. Recent methods employ video diffusion models (VDMs) to enhance robotic policies, improving their understanding and modeling of the physical world. However, existing approaches overlook the coherent and physically consistent motion representations inherently encoded across frames in VDMs. To this end, we propose Video2Act, a framework that efficiently guides robotic action learning by explicitly integrating spatial and motion-aware representations. Building on the inherent representations of VDMs, we extract foreground boundaries and inter-frame motion variations while filtering out background noise and task-irrelevant biases. These refined representations are then used as additional conditioning inputs to a diffusion transformer (DiT) action head, enabling it to reason about what to manipulate and how to move. To mitigate inference inefficiency, we propose an asynchronous dual-system design, where the VDM functions as the slow System 2 and the DiT head as the fast System 1, working collaboratively to generate adaptive actions. By providing motion-aware conditions to System 1, Video2Act maintains stable manipulation even with low-frequency updates from the VDM. For evaluation, Video2Act surpasses previous state-of-the-art VLA methods by 7.7% in simulation and 21.7% in real-world tasks in terms of average success rate, further exhibiting strong generalization capabilities.