Learning Multi-Modal Whole-Body Control for Real-World Humanoid Robots

arXiv cs.RO / 4/23/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper introduces the Masked Humanoid Controller (MHC), a learned whole-body control method for humanoid robots that uses masked target trajectories over selected robot state variables as a unified command interface.
The approach lets high-level systems specify diverse behaviors—such as footstep plans, partial-body mimicry, motion-capture-driven motions, video retargeting, and joystick teleoperation—in a flexible format.
MHC is trained in simulation with a curriculum spanning multiple input modalities, aiming to robustly execute partially specified behaviors while preserving balance and disturbance rejection.
The authors evaluate MHC in both simulation and on a real-world Digit V3 humanoid, finding that one controller can handle multiple command types via the same representation.
Overall, the work targets the longstanding challenge of providing a single interface that can command many whole-body behaviors without redesigning controllers for each modality.

Abstract

A major challenge in humanoid robotics is designing a unified interface for commanding diverse whole-body behaviors, from precise footstep sequences to partial-body mimicry and joystick teleoperation. We introduce the Masked Humanoid Controller (MHC), a learned whole-body controller that exposes a simple yet expressive interface: the specification of masked target trajectories over selected subsets of the robot's state variables. This unified abstraction allows high-level systems to issue commands in a flexible format that accommodates multi-modal inputs such as optimized trajectories, motion capture clips, re-targeted video, and real-time joystick signals. The MHC is trained in simulation using a curriculum that spans this full range of modalities, enabling robust execution of partially specified behaviors while maintaining balance and disturbance rejection. We demonstrate the MHC both in simulation and on the real-world Digit V3 humanoid, showing that a single learned controller is capable of executing such diverse whole-body commands in the real world through a common representational interface.