VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis

arXiv cs.RO / 4/13/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

VAG（Dual-Stream Video-Action Generation）は、ロボットの世界モデル/ワールドアクションにおける最大の課題である「動画と行動の対応（paired action trajectories）の不足」を、動画と行動を同時に生成する統一フレームワークで解決しようとする提案です。
flow matchingに基づく2つのブランチ（動画生成と行動生成）を同期し、さらに適応的3Dプーリングで動画側のコンパクトな全体文脈を行動側へ転送することで、モダリティ間の整合性を高める工夫がされています。
既存のWAモデルで起きがちな「動画と行動のアライメント不足」や、動画生成→行動推論の二段階パイプラインに伴う非効率・誤差累積を回避することを狙っています。
シミュレーションと実環境の双方で、整合したビデオ-アクション対の生成、実行可能な軌跡リプレイ、そして合成プリトレーニングデータによる下流ポリシーの汎化性能向上が示されたと報告されています。

Abstract

Recent advances in robot foundation models trained on large-scale human teleoperation data have enabled robots to perform increasingly complex real-world tasks. However, scaling these systems remains difficult because collecting task-specific demonstrations is expensive and labor-intensive. Synthetic data, especially generated videos, offer a promising direction, but existing World Models (WMs) are not directly suitable for policy learning since they do not provide paired action trajectories. World-Action (WA) models partially address this by predicting actions with visual outputs, yet often lack strong video-action alignment, while two-stage pipelines that generate video first and then infer actions introduce inefficiency and error accumulation. To address these limitations, we propose VAG, a unified flow-matching-based dual-stream framework that jointly generates video and action under visual and language conditioning. By synchronizing denoising in both branches and using an adaptive 3D pooling mechanism to transfer compact global video context to the action branch, VAG improves cross-modal consistency during generation. Across both simulated and real-world settings, VAG produces aligned video-action pairs with competitive prediction quality, supports executable trajectory replay, and provides useful synthetic pretraining data that improves downstream policy generalization, indicating its potential as a practical world-action model for embodied data synthesis.

Black Hat Asia

AI Business

Apple is building smart glasses without a display to serve as an AI wearable

THE DECODER

Why Fashion Trend Prediction Isn’t Enough Without Generative AI

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Chatbot vs Voicebot: The Real Business Decision Nobody Talks About

Dev.to

VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis

Key Points

Abstract

Related Articles

Black Hat Asia

Apple is building smart glasses without a display to serve as an AI wearable

Why Fashion Trend Prediction Isn’t Enough Without Generative AI

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Chatbot vs Voicebot: The Real Business Decision Nobody Talks About

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer