Exploring Motion-Language Alignment for Text-driven Motion Generation

arXiv cs.CV / 4/6/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses the challenge of aligning motion dynamics with textual semantics in text-driven human motion generation and reframes it as a motion-language alignment problem.
It proposes MLA-Gen, which combines global motion priors with fine-grained local conditioning to better capture common motion patterns while improving detailed text-motion alignment.
The authors identify an “attention sink” issue where attention overly concentrates on the first text token, weakening the use of informative cues and reducing semantic grounding.
They introduce SinkRatio to measure this attention concentration and develop alignment-aware masking and control strategies to regulate attention during generation.
Experiments on multiple baselines show consistent improvements in both motion quality and motion-language alignment, with code planned for release after acceptance.

Abstract

Text-driven human motion generation aims to synthesize realistic motion sequences that follow textual descriptions. Despite recent advances, accurately aligning motion dynamics with textual semantics remains a fundamental challenge. In this paper, we revisit text-to-motion generation from the perspective of motion-language alignment and propose MLA-Gen, a framework that integrates global motion priors with fine-grained local conditioning. This design enables the model to capture common motion patterns, while establishing detailed alignment between texts and motions. Furthermore, we identify a previously overlooked attention sink phenomenon in human motion generation, where attention disproportionately concentrates on the start text token, limiting the utilization of informative textual cues and leading to degraded semantic grounding. To analyze this issue, we introduce SinkRatio, a metric for measuring attention concentration, and develop alignment-aware masking and control strategies to regulate attention during generation. Extensive experiments demonstrate that our approach consistently improves both motion quality and motion-language alignment over strong baselines. Code will be released upon acceptance.