AI Navigate

FEEL (Force-Enhanced Egocentric Learning): A Dataset for Physical Action Understanding

arXiv cs.CV / 3/18/2026

📰 NewsTools & Practical UsageModels & Research

Key Points

  • FEEL (Force-Enhanced Egocentric Learning) is the first large-scale dataset pairing force measurements from custom piezoresistive gloves with egocentric video to enable force-informed physical action understanding.
  • It contains approximately 3 million force-synchronized frames from natural unscripted kitchen manipulation, with 45% of frames involving hand-object contact.
  • FEEL supports two task families: (1) contact understanding via temporal contact segmentation and pixel-level segmentation of contacted objects, and (2) action representation learning with force prediction as a self-supervised pretraining objective for video backbones.
  • The work reports state-of-the-art results on temporal contact segmentation, competitive pixel-level segmentation, and transfer gains on action understanding tasks across EPIC-Kitchens, SomethingSomething-V2, EgoExo4D and Meccano without manual labels.
  • By treating force as a primitive for physical interaction, FEEL enables scalable data collection and improved generalization for action understanding models.

Abstract

We introduce FEEL (Force-Enhanced Egocentric Learning), the first large-scale dataset pairing force measurements gathered from custom piezoresistive gloves with egocentric video. Our gloves enable scalable data collection, and FEEL contains approximately 3 million force-synchronized frames of natural unscripted manipulation in kitchen environments, with 45% of frames involving hand-object contact. Because force is the underlying cause that drives physical interaction, it is a critical primitive for physical action understanding. We demonstrate the utility of force for physical action understanding through application of FEEL to two families of tasks: (1) contact understanding, where we jointly perform temporal contact segmentation and pixel-level contacted object segmentation; and, (2) action representation learning, where force prediction serves as a self-supervised pretraining objective for video backbones. We achieve state-of-the-art temporal contact segmentation results and competitive pixel-level segmentation results without any need for manual contacted object segmentation annotations. Furthermore we demonstrate that action representation learning with FEEL improves transfer performance on action understanding tasks without any manual labels over EPIC-Kitchens, SomethingSomething-V2, EgoExo4D and Meccano.