Hoi! -- A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation

arXiv cs.RO / 4/17/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper introduces the Hoi! dataset, designed for force-grounded, cross-view articulated manipulation that links visual inputs, performed actions, and measured interaction forces.
  • The dataset includes 3,048 sequences involving 381 articulated objects across 38 environments, providing broad coverage for interaction research.
  • Each object is manipulated in four different embodiments—human hand, hand with a wrist-mounted camera, a handheld UMI gripper, and a custom Hoi! gripper—so that both robot and human perspectives can be compared.
  • By equipping the tool embodiment with end-effector force and tactile sensing, the dataset supports evaluating transfer between human and robotic viewpoints and exploring underused modalities like forces.
  • A project website is provided for access and details about the dataset: https://timengelbracht.github.io/Hoi-Dataset-Website/.

Abstract

We present a dataset for force-grounded, cross-view articulated manipulation that couples what is seen with what is done and what is felt during real human interaction. The dataset contains 3048 sequences across 381 articulated objects in 38 environments. Each object is operated in four embodiments - (i) human hand, (ii) human hand with a wrist-mounted camera, (iii) handheld UMI gripper, and (iv) a custom Hoi! gripper, where the tool embodiment provides end-effector forces and tactile sensing. Our dataset offers a holistic view of interaction understanding from video, enabling researchers to evaluate how well methods transfer between human and robotic viewpoints, but also investigate underexplored modalities such as interaction forces. The Project Website can be found at https://timengelbracht.github.io/Hoi-Dataset-Website/.