Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation

arXiv cs.RO / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes MoT-HRA, a hierarchical vision-language-action framework to learn embodiment-agnostic human intention priors for robotic manipulation from large-scale human demonstrations.
It introduces HA-2.2M, a newly curated 2.2M-episode action-language dataset built from heterogeneous human videos using hand-centric filtering, spatial reconstruction, temporal segmentation, and language alignment.
MoT-HRA decomposes manipulation into three coupled experts: a vision-language expert predicts 3D trajectories, an intention expert learns MANO-style latent hand-motion priors, and a fine expert converts intention-aware representations into robot action chunks.
A shared-attention trunk with read-only key-value transfer is designed to let downstream controllers leverage human priors while reducing interference with upstream representations.
Experiments across hand motion generation, simulated manipulation, and real-world robot tasks indicate improved motion plausibility and more robust control, especially under distribution shifts.

Abstract

Human videos contain rich manipulation priors, but using them for robot learning remains difficult because raw observations entangle scene understanding, human motion, and embodiment-specific action. We introduce MoT-HRA, a hierarchical vision-language-action framework that learns human-intention priors from large-scale human demonstrations. We first curate HA-2.2M, a 2.2M-episode action-language dataset reconstructed from heterogeneous human videos through hand-centric filtering, spatial reconstruction, temporal segmentation, and language alignment. On top of this dataset, MoT-HRA factorizes manipulation into three coupled experts: a vision-language expert predicts an embodiment-agnostic 3D trajectory, an intention expert models MANO-style hand motion as a latent human-motion prior, and a fine expert maps the intention-aware representation to robot action chunks. A shared-attention trunk and read-only key-value transfer allow downstream control to use human priors while limiting interference with upstream representations. Experiments on hand motion generation, simulated manipulation, and real-world robot tasks show that MoT-HRA improves motion plausibility and robust control under distribution shift.

What to Build Still Beats How

Dev.to

From Claim Denials to Smart Decisions: My Experience Using AI in Healthcare Claims Processing

Dev.to

v0.22.1

Ollama Releases

AI created job descriptions

Reddit r/artificial

Predictive Compliance: How AI Identifies Your Med Spa's Documentation Risks

Dev.to

Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation

Key Points

Abstract

Related Articles

What to Build Still Beats How

From Claim Denials to Smart Decisions: My Experience Using AI in Healthcare Claims Processing

v0.22.1

AI created job descriptions

Predictive Compliance: How AI Identifies Your Med Spa's Documentation Risks

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer