Exploring Motion-Language Alignment for Text-driven Motion Generation
arXiv cs.CV / 4/6/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper addresses the challenge of aligning motion dynamics with textual semantics in text-driven human motion generation and reframes it as a motion-language alignment problem.
- It proposes MLA-Gen, which combines global motion priors with fine-grained local conditioning to better capture common motion patterns while improving detailed text-motion alignment.
- The authors identify an “attention sink” issue where attention overly concentrates on the first text token, weakening the use of informative cues and reducing semantic grounding.
- They introduce SinkRatio to measure this attention concentration and develop alignment-aware masking and control strategies to regulate attention during generation.
- Experiments on multiple baselines show consistent improvements in both motion quality and motion-language alignment, with code planned for release after acceptance.




