Unified Number-Free Text-to-Motion Generation Via Flow Matching

arXiv cs.CV / 3/31/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Unified Motion Flow (UMF) to generate multi-person motion from text without requiring a fixed number of agents, addressing poor generalization in existing motion generators.
  • UMF separates the task into a single-pass motion prior generation stage (P-Flow) and multi-pass reaction generation stages, aiming to improve efficiency and reduce recursive error accumulation.
  • P-Flow uses hierarchical resolutions conditioned on different noise levels to lower computational overhead while learning strong priors across motion data.
  • S-Flow learns a joint probabilistic path for reaction transformation and context reconstruction, which the authors claim helps mitigate errors across iterative passes.
  • Experiments and user studies are reported to show UMF’s effectiveness as a text-to-motion “generalist” for multi-person motion generation, and the project page provides additional materials.

Abstract

Generative models excel at motion synthesis for a fixed number of agents but struggle to generalize with variable agents. Based on limited, domain-specific data, existing methods employ autoregressive models to generate motion recursively, which suffer from inefficiency and error accumulation. We propose Unified Motion Flow (UMF), which consists of Pyramid Motion Flow (P-Flow) and Semi-Noise Motion Flow (S-Flow). UMF decomposes the number-free motion generation into a single-pass motion prior generation stage and multi-pass reaction generation stages. Specifically, UMF utilizes a unified latent space to bridge the distribution gap between heterogeneous motion datasets, enabling effective unified training. For motion prior generation, P-Flow operates on hierarchical resolutions conditioned on different noise levels, thereby mitigating computational overheads. For reaction generation, S-Flow learns a joint probabilistic path that adaptively performs reaction transformation and context reconstruction, alleviating error accumulation. Extensive results and user studies demonstrate UMF' s effectiveness as a generalist model for multi-person motion generation from text. Project page: https://githubhgh.github.io/umf/.