Detecting Precise Hand Touch Moments in Egocentric Video

arXiv cs.CV / 4/15/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper tackles frame-level detection of the exact hand-object touch onset in egocentric (first-person) video, which is important for AR, HCI, assistive tech, and robot learning where contact cues action timing.
  • It introduces a Hand-informed Context Enhanced (HiCE) module that combines spatiotemporal hand-region features with surrounding context using cross-attention to better handle subtle motions and occlusions near contact.
  • The method is refined with a grasp-aware loss and soft labels to emphasize hand pose and motion dynamics typical of true touch versus near-contact frames.
  • It presents TouchMoment, an egocentric dataset with 4,021 videos and 8,456 annotated touch moments over more than one million frames.
  • On TouchMoment, using a strict two-frame tolerance evaluation, HiCE improves event-spotting performance and outperforms prior state-of-the-art baselines by 16.91% average precision.

Abstract

We address the challenging task of detecting the precise moment when hands make contact with objects in egocentric videos. This frame-level detection is crucial for augmented reality, human-computer interaction, assistive technologies, and robot learning applications, where contact onset signals action initiation or completion. Temporally precise detection is particularly challenging due to subtle hand motion variations near contact, frequent occlusions, fine-grained manipulation patterns, and the inherent motion dynamics of first-person perspectives. To tackle these challenges, we propose a Hand-informed Context Enhanced module (HiCE; pronounced `high-see') that leverages spatiotemporal features from hand regions and their surrounding context through cross-attention mechanisms, learning to identify potential contact patterns. Our approach is further refined with a grasp-aware loss and soft label that emphasizes hand pose patterns and movement dynamics characteristic of touch events, enabling the model to distinguish between near-contact and actual contact frames. We also introduce TouchMoment, an egocentric dataset containing 4,021 videos and 8,456 annotated contact moments spanning over one million frames. Experiments on TouchMoment show that, under a strict evaluation criterion that counts a prediction as correct only if it falls within a two-frame tolerance of the ground-truth moment, our method achieves substantial gains and outperforms state-of-the-art event-spotting baselines by 16.91% average precision.