EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation

arXiv cs.CV / 4/22/2026

📰 NewsModels & Research

共有:

Key Points

The paper introduces EgoMotion, a new approach to egocentric vision-language motion generation that synthesizes 3D human motion from first-person visual input plus natural-language instructions.
It identifies a key technical problem—reasoning-generation entanglement—where optimizing semantic reasoning and motion (kinematics) together causes gradient conflicts that reduce multimodal grounding and motion quality.
EgoMotion addresses this by using a hierarchical two-stage generative framework inspired by separating cognitive reasoning from motor control.
In the first (cognitive reasoning) stage, a vision-language model converts multimodal inputs into structured discrete motion-primitives representations to align semantics with actionable motion.
In the second (motion generation) stage, a diffusion-based generator iteratively denoises in a continuous latent space to produce physically plausible, temporally coherent trajectories, achieving state-of-the-art results.

Abstract

Faithfully modeling human behavior in dynamic environments is a foundational challenge for embodied intelligence. While conditional motion synthesis has achieved significant advances, egocentric motion generation remains largely underexplored due to the inherent complexity of first-person perception. In this work, we investigate Egocentric Vision-Language (Ego-VL) motion generation. This task requires synthesizing 3D human motion conditioned jointly on first-person visual observations and natural language instructions. We identify a critical \textit{reasoning-generation entanglement} challenge: the simultaneous optimization of semantic reasoning and kinematic modeling introduces gradient conflicts. These conflicts systematically degrade the fidelity of multimodal grounding and motion quality. To address this challenge, we propose a hierarchical generative framework \textbf{EgoMotion}. Inspired by the biological decoupling of cognitive reasoning and motor control, EgoMotion operates in two stages. In the Cognitive Reasoning stage, A vision-language model (VLM) projects multimodal inputs into a structured space of discrete motion primitives. This forces the VLM to acquire goal-consistent representations, effectively bridging the semantic gap between high-level perceptual understanding and low-level action execution. In the Motion Generation stage, these learned representations serve as expressive conditioning signals for a diffusion-based motion generator. By performing iterative denoising within a continuous latent space, the generator synthesizes physically plausible and temporally coherent trajectories. Extensive evaluations demonstrate that EgoMotion achieves state-of-the-art performance, and produces motion sequences that are both semantically grounded and kinematically superior to existing approaches.