Learning Reactive Human Motion Generation from Paired Interaction Data Using Transformer-Based Models

arXiv cs.CV / 4/27/2026

📰 NewsModels & Research

Key Points

  • The paper studies interactive human motion generation, aiming to predict one person’s motion conditioned on another person’s mutually dependent actions rather than single-agent motion.
  • It builds a new dataset of paired action–reaction motion sequences extracted from boxing match videos and evaluates Transformer-based approaches on the task.
  • Three Transformer variants are compared (a simple Transformer, iTransformer, and Crossformer), with findings that the simple Transformer produces plausible interaction-aware motions without posture collapse.
  • iTransformer and Crossformer are reported to accumulate errors over time, resulting in unstable motion generation.
  • The authors propose adding a person ID embedding to distinguish individuals explicitly, which helps maintain structural consistency and reduces the likelihood of structural collapse.

Abstract

Recent advances in deep learning have enabled the generation of videos from textual descriptions as well as the prediction of future sequences from input videos. Similarly, in human motion modeling, motions can be generated from text or predicted from a single person's motion sequence. However, these approaches primarily focus on single-agent motion generation. In contrast, this study addresses the problem of generating the motion of one person based on the motion of another in interaction scenarios, where the two motions are mutually dependent. We construct a dataset of paired action-reaction motion sequences extracted from boxing match videos and investigate the effectiveness of Transformer-based models for this task. Specifically, we implement and compare three models: a simple Transformer, iTransformer, and Crossformer. In addition, we introduce a person ID embedding to explicitly distinguish between individuals, enabling the model to maintain structural consistency and better capture interaction dynamics. Experimental results show that the simple Transformer can generate plausible interaction-aware motions without suffering from posture collapse, while iTransformer and Crossformer accumulate errors over time, leading to unstable motion generation. Furthermore, the proposed person ID embedding contributes to preventing structural collapse and improving motion consistency. These results highlight the importance of explicitly modeling individual identity in interaction-aware motion generation.