UniDex: A Robot Foundation Suite for Universal Dexterous Hand Control from Egocentric Human Videos

arXiv cs.RO / 2026/3/24

📰 ニュースSignals & Early TrendsIdeas & Deep AnalysisModels & Research

要点

  • UniDex proposes a “robot foundation suite” for universal dexterous hand control by combining a large robot-centric dataset, a unified vision-language-action (VLA) policy, and a human-data capture setup.
  • The approach builds UniDex-Dataset with 50K+ robot trajectories across eight different dexterous hands (6–24 DoFs) by retargeting egocentric human videos via a human-in-the-loop procedure that preserves plausible hand-object contacts.
  • UniDex introduces FAAS (Function-Actuator-Aligned Space), an action-space mapping that aligns functionally similar actuators across different hand embodiments to enable cross-hand transfer.
  • A pretrained UniDex-VLA policy is trained on the robot-centric dataset and then fine-tuned with task demonstrations, achieving strong performance on tool-use tasks (81% average task progress) and notable zero-shot cross-hand generalization.
  • UniDex-Cap provides a portable RGB-D + hand-pose capture pipeline to convert human data into robot-executable trajectories, aiming to reduce dependence on expensive robot teleoperation for co-training.

Abstract

Dexterous manipulation remains challenging due to the cost of collecting real-robot teleoperation data, the heterogeneity of hand embodiments, and the high dimensionality of control. We present UniDex, a robot foundation suite that couples a large-scale robot-centric dataset with a unified vision-language-action (VLA) policy and a practical human-data capture setup for universal dexterous hand control. First, we construct UniDex-Dataset, a robot-centric dataset over 50K trajectories across eight dexterous hands (6--24 DoFs), derived from egocentric human video datasets. To transform human data into robot-executable trajectories, we employ a human-in-the-loop retargeting procedure to align fingertip trajectories while preserving plausible hand-object contacts, and we operate on explicit 3D pointclouds with human hands masked to narrow kinematic and visual gaps. Second, we introduce the Function-Actuator-Aligned Space (FAAS), a unified action space that maps functionally similar actuators to shared coordinates, enabling cross-hand transfer. Leveraging FAAS as the action parameterization, we train UniDex-VLA, a 3D VLA policy pretrained on UniDex-Dataset and finetuned with task demonstrations. In addition, we build UniDex-Cap, a simple portable capture setup that records synchronized RGB-D streams and human hand poses and converts them into robot-executable trajectories to enable human-robot data co-training that reduces reliance on costly robot demonstrations. On challenging tool-use tasks across two different hands, UniDex-VLA achieves 81% average task progress and outperforms prior VLA baselines by a large margin, while exhibiting strong spatial, object, and zero-shot cross-hand generalization. Together, UniDex-Dataset, UniDex-VLA, and UniDex-Cap provide a scalable foundation suite for universal dexterous manipulation.