Re$^2$MoGen: Open-Vocabulary Motion Generation via LLM Reasoning and Physics-Aware Refinement

arXiv cs.RO / 4/21/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Re$^2$MoGen, an open-vocabulary text-to-motion generation framework designed to handle cases where motion descriptions differ substantially from training texts.
  • It uses enhanced LLM reasoning (improved via Monte Carlo tree search) to produce initial motion keyframes from text prompts while only specifying the root and a few key joints to simplify planning.
  • A human pose model is then used as a prior to complete full-body poses from the planned keyframes, and the resulting partial motion supervises a pre-trained motion generator using a dynamic temporal matching objective for spatiotemporal completion.
  • Finally, the method applies reinforcement learning post-training with physics-aware rewards to refine the motions and remove physically implausible artifacts, achieving state-of-the-art results in open-vocabulary motion generation.

Abstract

Text-to-motion (T2M) generation aims to control the behavior of a target character via textual descriptions. Leveraging text-motion paired datasets, existing T2M models have achieved impressive performance in generating high-quality motions within the distribution of their training data. However, their performance deteriorates notably when the motion descriptions differ significantly from the training texts. To address this issue, we propose Re^2MoGen, a Reasoning and Refinement open-vocabulary Motion Generation framework that leverages enhanced Large Language Model (LLM) reasoning to generate an initial motion planning and then refine its physical plausibility via reinforcement learning (RL) post-training. Specifically, Re^2MoGen consists of three stages: We first employ Monte Carlo tree search to enhance the LLM's reasoning ability in generating reasonable keyframes of the motion based on text prompts, specifying only the root and several key joints' positions to ease the reasoning process. Then, we apply a human pose model as a prior to optimize the full-body poses based on the planned keyframes and use the resulting incomplete motion to supervise fine-tuning a pre-trained motion generator via a dynamic temporal matching objective, enabling spatiotemporal completion. Finally, we use post-training with physics-aware reward to refine motion quality to eliminate physical implausibility in LLM-planned motions. Extensive experiments demonstrate that our framework can generate semantically consistent and physically plausible motions and achieve state-of-the-art performance in open-vocabulary motion generation.