Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion

arXiv cs.RO / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • Action Agent proposes a two-stage framework that combines agentic navigation video generation with flow-constrained diffusion to control multi-embodiment robot navigation from language and images.
  • In Stage I, an LLM orchestrates video diffusion model selection, iteratively validates and refines prompts, and builds cross-task memory to generate physically plausible first-person navigation videos, raising success rate from 35% to 86% across 50 tasks.
  • In Stage II, FlowDiT (a Flow-Constrained Diffusion Transformer) converts goal videos plus language instructions into continuous velocity commands via action-space denoising diffusion, using DINOv2 features, learned optical flow for ego-motion, and CLIP embeddings for semantic stopping.
  • The system is pretrained on RECON and fine-tuned on Unitree G1 humanoid data from Isaac Sim, with a single 43M-parameter checkpoint achieving 73.2% simulation navigation success and 64.7% task completion on real unseen indoor environments under open-loop execution while running at 40–47 Hz.
  • Experiments across a humanoid robot, a drone, and a wheeled robot suggest that separating trajectory imagination from execution provides a scalable, embodiment-aware approach for language-guided navigation.

Abstract

We present Action Agent, a two-stage framework that unifies agentic navigation video generation with flow-constrained diffusion control for multi-embodiment robot navigation. In Stage I, a large language model (LLM) acts as an orchestration module that selects video diffusion models, refines prompts through iterative validation, and accumulates cross-task memory to synthesize physically plausible first-person navigation videos from language and image inputs. This increases video generation success from 35% (single-shot) to 86% across 50 navigation tasks. In Stage II, we introduce FlowDiT, a Flow-Constrained Diffusion Transformer that converts optimized goal videos and language instructions into continuous velocity commands using action-space denoising diffusion. FlowDiT integrates DINOv2 visual features, learned optical flow for ego-motion representation, and CLIP language embeddings for semantic stopping. We pretrain on the RECON outdoor navigation dataset and fine-tune on 203 Unitree G1 humanoid episodes collected in Isaac Sim to calibrate velocity dynamics. A single 43M-parameter checkpoint achieves 73.2% navigation success in simulation and 64.7% task completion on a real Unitree G1 in unseen indoor environments under open-loop execution, while operating at 40--47 Hz. We evaluate Action Agent across three embodiments: a Unitree G1 humanoid (real hardware), a drone, and a wheeled mobile robot (Isaac Sim), demonstrating that decoupling trajectory imagination from execution yields a scalable and embodiment-aware paradigm for language-guided navigation.