Detecting Precise Hand Touch Moments in Egocentric Video
arXiv cs.CV / 4/15/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper tackles frame-level detection of the exact hand-object touch onset in egocentric (first-person) video, which is important for AR, HCI, assistive tech, and robot learning where contact cues action timing.
- It introduces a Hand-informed Context Enhanced (HiCE) module that combines spatiotemporal hand-region features with surrounding context using cross-attention to better handle subtle motions and occlusions near contact.
- The method is refined with a grasp-aware loss and soft labels to emphasize hand pose and motion dynamics typical of true touch versus near-contact frames.
- It presents TouchMoment, an egocentric dataset with 4,021 videos and 8,456 annotated touch moments over more than one million frames.
- On TouchMoment, using a strict two-frame tolerance evaluation, HiCE improves event-spotting performance and outperforms prior state-of-the-art baselines by 16.91% average precision.
Related Articles

Black Hat Asia
AI Business

The Complete Guide to Better Meeting Productivity with AI Note-Taking
Dev.to

5 Ways Real-Time AI Can Boost Your Sales Call Performance
Dev.to

RAG in Practice — Part 4: Chunking, Retrieval, and the Decisions That Break RAG
Dev.to
Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]
Reddit r/MachineLearning