UMI-3D: Extending Universal Manipulation Interface from Vision-Limited to 3D Spatial Perception

arXiv cs.RO / 4/16/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

UMI-3D is a multimodal extension of the Universal Manipulation Interface (UMI) designed to improve embodied manipulation data collection beyond UMI’s monocular vision-only SLAM limitations.
By integrating a lightweight, low-cost wrist-mounted LiDAR sensor and using LiDAR-centric SLAM, UMI-3D provides robust, accurate metric-scale pose estimation in occluded, dynamic, and tracking-failure scenarios.
The work introduces a hardware-synchronized multimodal sensing pipeline and a unified spatiotemporal calibration framework to align visual observations with LiDAR point clouds into consistent 3D representations of demonstrations.
UMI-3D keeps the original 2D visuomotor policy formulation but delivers higher-quality, more reliable data that translates into improved policy performance and enables learning tasks difficult or infeasible for vision-only UMI, such as deformable and articulated object manipulation.
The system includes an end-to-end workflow for acquisition, alignment, training, and deployment, and releases hardware and software as open source to support large-scale embodied intelligence research.

Abstract

We present UMI-3D, a multimodal extension of the Universal Manipulation Interface (UMI) for robust and scalable data collection in embodied manipulation. While UMI enables portable, wrist-mounted data acquisition, its reliance on monocular visual SLAM makes it vulnerable to occlusions, dynamic scenes, and tracking failures, limiting its applicability in real-world environments. UMI-3D addresses these limitations by introducing a lightweight and low-cost LiDAR sensor tightly integrated into the wrist-mounted interface, enabling LiDAR-centric SLAM with accurate metric-scale pose estimation under challenging conditions. We further develop a hardware-synchronized multimodal sensing pipeline and a unified spatiotemporal calibration framework that aligns visual observations with LiDAR point clouds, producing consistent 3D representations of demonstrations. Despite maintaining the original 2D visuomotor policy formulation, UMI-3D significantly improves the quality and reliability of collected data, which directly translates into enhanced policy performance. Extensive real-world experiments demonstrate that UMI-3D not only achieves high success rates on standard manipulation tasks, but also enables learning of tasks that are challenging or infeasible for the original vision-only UMI setup, including large deformable object manipulation and articulated object operation. The system supports an end-to-end pipeline for data acquisition, alignment, training, and deployment, while preserving the portability and accessibility of the original UMI. All hardware and software components are open-sourced to facilitate large-scale data collection and accelerate research in embodied intelligence: \href{https://umi-3d.github.io}{https://umi-3d.github.io}.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/16DailyView insight →

Black Hat Asia

AI Business

oh-my-agent is Now Official on Homebrew-core: A New Milestone for Multi-Agent Orchestration

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Training Qwen2.5-0.5B-Instruct on Reddit posts summarization tasks with length constraint on my 3xMac Minis with GRPO - evals update

Reddit r/LocalLLaMA

GGUF Quants Arena for MMLU (24GB VRAM + 128GB RAM)

Reddit r/LocalLLaMA

UMI-3D: Extending Universal Manipulation Interface from Vision-Limited to 3D Spatial Perception

Key Points

Abstract

💡 Insights using this article

Related Articles

Black Hat Asia

oh-my-agent is Now Official on Homebrew-core: A New Milestone for Multi-Agent Orchestration

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Training Qwen2.5-0.5B-Instruct on Reddit posts summarization tasks with length constraint on my 3xMac Minis with GRPO - evals update

GGUF Quants Arena for MMLU (24GB VRAM + 128GB RAM)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer