Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion

arXiv cs.RO / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

Action Agent proposes a two-stage framework that combines agentic navigation video generation with flow-constrained diffusion to control multi-embodiment robot navigation from language and images.
In Stage I, an LLM orchestrates video diffusion model selection, iteratively validates and refines prompts, and builds cross-task memory to generate physically plausible first-person navigation videos, raising success rate from 35% to 86% across 50 tasks.
In Stage II, FlowDiT (a Flow-Constrained Diffusion Transformer) converts goal videos plus language instructions into continuous velocity commands via action-space denoising diffusion, using DINOv2 features, learned optical flow for ego-motion, and CLIP embeddings for semantic stopping.
The system is pretrained on RECON and fine-tuned on Unitree G1 humanoid data from Isaac Sim, with a single 43M-parameter checkpoint achieving 73.2% simulation navigation success and 64.7% task completion on real unseen indoor environments under open-loop execution while running at 40–47 Hz.
Experiments across a humanoid robot, a drone, and a wheeled robot suggest that separating trajectory imagination from execution provides a scalable, embodiment-aware approach for language-guided navigation.

Abstract

We present Action Agent, a two-stage framework that unifies agentic navigation video generation with flow-constrained diffusion control for multi-embodiment robot navigation. In Stage I, a large language model (LLM) acts as an orchestration module that selects video diffusion models, refines prompts through iterative validation, and accumulates cross-task memory to synthesize physically plausible first-person navigation videos from language and image inputs. This increases video generation success from 35% (single-shot) to 86% across 50 navigation tasks. In Stage II, we introduce FlowDiT, a Flow-Constrained Diffusion Transformer that converts optimized goal videos and language instructions into continuous velocity commands using action-space denoising diffusion. FlowDiT integrates DINOv2 visual features, learned optical flow for ego-motion representation, and CLIP language embeddings for semantic stopping. We pretrain on the RECON outdoor navigation dataset and fine-tune on 203 Unitree G1 humanoid episodes collected in Isaac Sim to calibrate velocity dynamics. A single 43M-parameter checkpoint achieves 73.2% navigation success in simulation and 64.7% task completion on a real Unitree G1 in unseen indoor environments under open-loop execution, while operating at 40--47 Hz. We evaluate Action Agent across three embodiments: a Unitree G1 humanoid (real hardware), a drone, and a wheeled mobile robot (Isaac Sim), demonstrating that decoupling trajectory imagination from execution yields a scalable and embodiment-aware paradigm for language-guided navigation.

Singapore's Fraud Frontier: Why AI Scam Detection Demands Regulatory Precision

Dev.to

First experience with Building Apps with Google AI Studio: Incredibly simple and intuitive.

Dev.to

Meta will use AI to analyze height and bone structure to identify if users are underage

TechCrunch

How AI is Changing the Way We Code in 2026: The Shift from Syntax to Strategy

Dev.to

13 CLAUDE.md Rules That Make AI Write Modern PHP (Not PHP 5 Resurrected)

Dev.to

Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion

Key Points

Abstract

Related Articles

Singapore's Fraud Frontier: Why AI Scam Detection Demands Regulatory Precision

First experience with Building Apps with Google AI Studio: Incredibly simple and intuitive.

Meta will use AI to analyze height and bone structure to identify if users are underage

How AI is Changing the Way We Code in 2026: The Shift from Syntax to Strategy

13 CLAUDE.md Rules That Make AI Write Modern PHP (Not PHP 5 Resurrected)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer