Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation
arXiv cs.RO / 4/28/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes MoT-HRA, a hierarchical vision-language-action framework to learn embodiment-agnostic human intention priors for robotic manipulation from large-scale human demonstrations.
- It introduces HA-2.2M, a newly curated 2.2M-episode action-language dataset built from heterogeneous human videos using hand-centric filtering, spatial reconstruction, temporal segmentation, and language alignment.
- MoT-HRA decomposes manipulation into three coupled experts: a vision-language expert predicts 3D trajectories, an intention expert learns MANO-style latent hand-motion priors, and a fine expert converts intention-aware representations into robot action chunks.
- A shared-attention trunk with read-only key-value transfer is designed to let downstream controllers leverage human priors while reducing interference with upstream representations.
- Experiments across hand motion generation, simulated manipulation, and real-world robot tasks indicate improved motion plausibility and more robust control, especially under distribution shifts.


