HiSync: Spatio-Temporally Aligning Hand Motion from Wearable IMU and On-Robot Camera for Command Source Identification in Long-Range HRI

arXiv cs.RO / 3/26/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper proposes HiSync, an optical-inertial fusion framework that aligns a robot-mounted camera’s optical flow with a hand-worn IMU to identify the command source in long-range, multi-user HRI.
  • HiSync learns frequency-domain features from both modalities, denoises IMU signals with CSINet, and uses temporal alignment plus distance-aware multi-window fusion to match subtle natural gestures.
  • The authors collect a user-defined gesture set (N=12) and a multimodal command gesture dataset (N=38) for long-range multi-user scenarios, targeting the ambiguity created by distance and multiple users.
  • In three-person scenes up to 34 meters, HiSync reports 92.32% CSI accuracy and claims a 48.44% improvement over prior state of the art, and it is validated through real-robot deployment.
  • The work is positioned as a practical HRI primitive and provides design guidance, with code released on GitHub for reproducibility and further development.

Abstract

Long-range Human-Robot Interaction (HRI) remains underexplored. Within it, Command Source Identification (CSI) - determining who issued a command - is especially challenging due to multi-user and distance-induced sensor ambiguity. We introduce HiSync, an optical-inertial fusion framework that treats hand motion as binding cues by aligning robot-mounted camera optical flow with hand-worn IMU signals. We first elicit a user-defined (N=12) gesture set and collect a multimodal command gesture dataset (N=38) in long-range multi-user HRI scenarios. Next, HiSync extracts frequency-domain hand motion features from both camera and IMU data, and a learned CSINet denoises IMU readings, temporally aligns modalities, and performs distance-aware multi-window fusion to compute cross-modal similarity of subtle, natural gestures, enabling robust CSI. In three-person scenes up to 34m, HiSync achieves 92.32% CSI accuracy, outperforming the prior SOTA by 48.44%. HiSync is also validated on real-robot deployment. By making CSI reliable and natural, HiSync provides a practical primitive and design guidance for public-space HRI. https://github.com/OctopusWen/HiSync