Controllable Text-to-Motion Generation via Modular Body-Part Phase Control

arXiv cs.CV / 3/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes Modular Body-Part Phase Control, a plug-and-play framework for localized editing in text-to-motion generation.
  • It treats body-part dynamics as sinusoidal phase signals with amplitude, frequency, phase shift, and offset to produce compact, interpretable controls.
  • A modular Phase ControlNet branch injects these part signals through residual feature modulation, decoupling local editing from the backbone generator.
  • Experimental results on diffusion- and flow-based models demonstrate predictable, fine-grained control over motion magnitude, speed, and timing while preserving global motion coherence.

Abstract

Text-to-motion (T2M) generation is becoming a practical tool for animation and interactive avatars. However, modifying specific body parts while maintaining overall motion coherence remains challenging. Existing methods typically rely on cumbersome, high-dimensional joint constraints (e.g., trajectories), which hinder user-friendly, iterative refinement. To address this, we propose Modular Body-Part Phase Control, a plug-and-play framework enabling structured, localized editing via a compact, scalar-based phase interface. By modeling body-part latent motion channels as sinusoidal phase signals characterized by amplitude, frequency, phase shift, and offset, we extract interpretable codes that capture part-specific dynamics. A modular Phase ControlNet branch then injects this signal via residual feature modulation, seamlessly decoupling control from the generative backbone. Experiments on both diffusion- and flow-based models demonstrate that our approach provides predictable and fine-grained control over motion magnitude, speed, and timing. It preserves global motion coherence and offers a practical paradigm for controllable T2M generation. Project page: https://jixiii.github.io/bp-phase-project-page/