OmniUMI: Towards Physically Grounded Robot Learning via Human-Aligned Multimodal Interaction

arXiv cs.RO / 4/14/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • OmniUMI は、ロボット学習に必要な接触ダイナミクス(触覚、把持力、外部接触レンチ)を、RGB中心の従来手法の限界を補う形でマルチモーダルに同期収集する統一フレームワークを提案しています。
  • コンパクトなハンドヘルド端末で RGB・深度・軌道に加え、触覚、内部把持力、外部相互作用レンチを取得し、同じ身体(embodiment)設計で収集と実運用の一貫性を保ちます。
  • 人間に合わせたデモ取得を支えるため、両側グリッパの双方向力フィードバックと、ハンドヘルドの自然な知覚に基づく外部レンチの表現を提供します。
  • OmniUMI 上で diffusion policy を拡張し、視覚・触覚・力に関する観測を用いた学習と、インピーダンス制御による運動と接触挙動の統合調整を実現し、ピック&プレース、表面消去、触覚に基づく選択的リリースで有効性を示しています。

Abstract

UMI-style interfaces enable scalable robot learning, but existing systems remain largely visuomotor, relying primarily on RGB observations and trajectory while providing only limited access to physical interaction signals. This becomes a fundamental limitation in contact-rich manipulation, where success depends on contact dynamics such as tactile interaction, internal grasping force, and external interaction wrench that are difficult to infer from vision alone. We present OmniUMI, a unified framework for physically grounded robot learning via human-aligned multimodal interaction. OmniUMI synchronously captures RGB, depth, trajectory, tactile sensing, internal grasping force, and external interaction wrench within a compact handheld system, while maintaining collection--deployment consistency through a shared embodiment design. To support human-aligned demonstration, OmniUMI provides dual-force feedback through bilateral gripper feedback and natural perception of external interaction wrench in the handheld embodiment. Built on this interface, we extend diffusion policy with visual, tactile, and force-related observations, and deploy the learned policy through impedance-based execution for unified regulation of motion and contact behavior. Experiments demonstrate reliable sensing and strong downstream performance on force-sensitive pick-and-place, interactive surface erasing, and tactile-informed selective release. Overall, OmniUMI combines physically grounded multimodal data acquisition with human-aligned interaction, providing a scalable foundation for learning contact-rich manipulation.