Next-Scale Autoregressive Models for Text-to-Motion Generation
arXiv cs.CV / 4/7/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces MoScale, a next-scale autoregressive framework for text-to-motion generation that better matches motion’s temporal structure than standard next-token prediction.
- MoScale generates motion hierarchically from coarse to fine temporal resolutions, supplying global semantics early and progressively refining them to capture long-range structure.
- To handle limited paired text-motion data, the method adds cross-scale hierarchical refinement (improving per-scale initial predictions) and in-scale temporal refinement (selectively re-predicting bidirectionally within a scale).
- The authors report state-of-the-art text-to-motion results with high training efficiency, scaling with model size, and strong zero-shot generalization to diverse generation and editing tasks.
Related Articles

Black Hat Asia
AI Business

OpenAI's pricing is about to change — here's why local AI matters more than ever
Dev.to

Google AI Tells Users to Put Glue on Their Pizza!
Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Could it be that this take is not too far fetched?
Reddit r/LocalLLaMA