UMI-3D: Extending Universal Manipulation Interface from Vision-Limited to 3D Spatial Perception

arXiv cs.RO / 4/16/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • UMI-3D is a multimodal extension of the Universal Manipulation Interface (UMI) designed to improve embodied manipulation data collection beyond UMI’s monocular vision-only SLAM limitations.
  • By integrating a lightweight, low-cost wrist-mounted LiDAR sensor and using LiDAR-centric SLAM, UMI-3D provides robust, accurate metric-scale pose estimation in occluded, dynamic, and tracking-failure scenarios.
  • The work introduces a hardware-synchronized multimodal sensing pipeline and a unified spatiotemporal calibration framework to align visual observations with LiDAR point clouds into consistent 3D representations of demonstrations.
  • UMI-3D keeps the original 2D visuomotor policy formulation but delivers higher-quality, more reliable data that translates into improved policy performance and enables learning tasks difficult or infeasible for vision-only UMI, such as deformable and articulated object manipulation.
  • The system includes an end-to-end workflow for acquisition, alignment, training, and deployment, and releases hardware and software as open source to support large-scale embodied intelligence research.

Abstract

We present UMI-3D, a multimodal extension of the Universal Manipulation Interface (UMI) for robust and scalable data collection in embodied manipulation. While UMI enables portable, wrist-mounted data acquisition, its reliance on monocular visual SLAM makes it vulnerable to occlusions, dynamic scenes, and tracking failures, limiting its applicability in real-world environments. UMI-3D addresses these limitations by introducing a lightweight and low-cost LiDAR sensor tightly integrated into the wrist-mounted interface, enabling LiDAR-centric SLAM with accurate metric-scale pose estimation under challenging conditions. We further develop a hardware-synchronized multimodal sensing pipeline and a unified spatiotemporal calibration framework that aligns visual observations with LiDAR point clouds, producing consistent 3D representations of demonstrations. Despite maintaining the original 2D visuomotor policy formulation, UMI-3D significantly improves the quality and reliability of collected data, which directly translates into enhanced policy performance. Extensive real-world experiments demonstrate that UMI-3D not only achieves high success rates on standard manipulation tasks, but also enables learning of tasks that are challenging or infeasible for the original vision-only UMI setup, including large deformable object manipulation and articulated object operation. The system supports an end-to-end pipeline for data acquisition, alignment, training, and deployment, while preserving the portability and accessibility of the original UMI. All hardware and software components are open-sourced to facilitate large-scale data collection and accelerate research in embodied intelligence: \href{https://umi-3d.github.io}{https://umi-3d.github.io}.