Next-Scale Autoregressive Models for Text-to-Motion Generation

arXiv cs.CV / 4/7/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces MoScale, a next-scale autoregressive framework for text-to-motion generation that better matches motion’s temporal structure than standard next-token prediction.
  • MoScale generates motion hierarchically from coarse to fine temporal resolutions, supplying global semantics early and progressively refining them to capture long-range structure.
  • To handle limited paired text-motion data, the method adds cross-scale hierarchical refinement (improving per-scale initial predictions) and in-scale temporal refinement (selectively re-predicting bidirectionally within a scale).
  • The authors report state-of-the-art text-to-motion results with high training efficiency, scaling with model size, and strong zero-shot generalization to diverse generation and editing tasks.

Abstract

Autoregressive (AR) models offer stable and efficient training, but standard next-token prediction is not well aligned with the temporal structure required for text-conditioned motion generation. We introduce MoScale, a next-scale AR framework that generates motion hierarchically from coarse to fine temporal resolutions. By providing global semantics at the coarsest scale and refining them progressively, MoScale establishes a causal hierarchy better suited for long-range motion structure. To improve robustness under limited text-motion data, we further incorporate cross-scale hierarchical refinement for improving per-scale initial predictions and in-scale temporal refinement for selective bidirectional re-prediction. MoScale achieves SOTA text-to-motion performance with high training efficiency, scales effectively with model size, and generalizes zero-shot to diverse motion generation and editing tasks.