A Unified Conditional Flow for Motion Generation, Editing, and Intra-Structural Retargeting

arXiv cs.AI / 4/16/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a unified generative framework that treats text-driven motion editing and intra-structural retargeting as the same problem of conditional transport via flow matching.
  • It argues that editing and retargeting differ mainly in which conditioning signal (semantic from text vs. structural from target skeletons) is modulated during inference, enabling a single model to cover both tasks.
  • The authors implement a rectified-flow motion model that is jointly conditioned on text prompts and target skeletal structures, extending a DiT-style transformer with per-joint tokenization and joint self-attention to enforce kinematic dependencies.
  • A multi-condition classifier-free guidance strategy is used to balance text adherence with skeletal conformity, improving consistency versus fragmented task-specific pipelines.
  • Experiments on SnapMoGen and a multi-character Mixamo subset report that one trained model can perform text-to-motion generation as well as zero-shot editing and zero-shot intra-structural retargeting.

Abstract

Text-driven motion editing and intra-structural retargeting, where source and target share topology but may differ in bone lengths, are traditionally handled by fragmented pipelines with incompatible inputs and representations: editing relies on specialized generative steering, while retargeting is deferred to geometric post-processing. We present a unifying perspective where both tasks are cast as instances of conditional transport within a single generative framework. By leveraging recent advances in flow matching, we demonstrate that editing and retargeting are fundamentally the same generative task, distinguished only by which conditioning signal, semantic or structural, is modulated during inference. We implement this vision via a rectified-flow motion model jointly conditioned on text prompts and target skeletal structures. Our architecture extends a DiT-style transformer with per-joint tokenization and explicit joint self-attention to strictly enforce kinematic dependencies, while a multi-condition classifier-free guidance strategy balances text adherence with skeletal conformity. Experiments on SnapMoGen and a multi-character Mixamo subset show that a single trained model supports text-to-motion generation, zero-shot editing, and zero-shot intra-structural retargeting. This unified approach simplifies deployment and improves structural consistency compared to task-specific baselines.