Commanding Humanoid by Free-form Language: A Large Language Action Model with Unified Motion Vocabulary

arXiv cs.RO / 4/13/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Humanoid-LLA, a Large Language Action Model that converts free-form natural language into physically executable whole-body actions for humanoid robots.
  • It proposes a unified motion vocabulary that maps human and humanoid motion primitives into a shared discrete space to improve motion diversity while preserving plausibility.
  • A vocabulary-directed controller distilled from a privileged policy is used to maintain physical feasibility of the generated actions.
  • The method includes physics-informed fine-tuning via reinforcement learning with dynamics-aware rewards to improve robustness and stability.
  • Experiments in simulation and on Unitree G1 and Booster T1 humanoids indicate improved language generalization and better motion naturalness, stability, and execution success versus prior language-conditioned controllers.

Abstract

Enabling humanoid robots to follow free-form language commands is critical for seamless human-robot interaction, collaborative task execution, and general-purpose embodied intelligence. While recent advances have improved low-level humanoid locomotion and robot manipulation, language-conditioned whole-body control remains a significant challenge. Existing methods are often limited to simple instructions and sacrifice either motion diversity or physical plausibility. To address this, we introduce Humanoid-LLA, a Large Language Action Model that maps expressive language commands to physically executable whole-body actions for humanoid robots. Our approach integrates three core components: a unified motion vocabulary that aligns human and humanoid motion primitives into a shared discrete space; a vocabulary-directed controller distilled from a privileged policy to ensure physical feasibility; and a physics-informed fine-tuning stage using reinforcement learning with dynamics-aware rewards to enhance robustness and stability. Extensive evaluations in simulation and on real-world Unitree G1 and Booster T1 humanoids show that Humanoid-LLA delivers strong language generalization while maintaining high physical fidelity, outperforming existing language-conditioned controllers in motion naturalness, stability, and execution success rate.