M2R2: MultiModal Robotic Representation for Temporal Action Segmentation

arXiv cs.RO / 4/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces M2R2, a multimodal robotic feature extractor for temporal action segmentation that jointly uses proprioceptive (robot state) and exteroceptive (vision/sensor) information.
  • It proposes a new training strategy designed to make learned features reusable across multiple TAS models, addressing a limitation of prior multimodal approaches that entangle feature fusion inside each model.
  • The authors report new state-of-the-art results on three robotic datasets—REASSEMBLE, (Im)PerfectPour, and JIGSAWS.
  • An extensive ablation study is included to quantify how different sensor modalities contribute to performance in robotic TAS tasks.
  • The work targets a key mismatch between robotics and vision pipelines: vision-only pretrained extractors can degrade when object visibility is limited, which M2R2 aims to mitigate.

Abstract

Temporal action segmentation (TAS) has long been a key area of research in both robotics and computer vision. In robotics, algorithms have primarily focused on leveraging proprioceptive information to determine skill boundaries, with recent approaches in surgical robotics incorporating vision. In contrast, computer vision typically relies on exteroceptive sensors, such as cameras. Existing multimodal TAS models in robotics integrate feature fusion within the model, making it difficult to reuse learned features across different models. Meanwhile, pretrained vision-only feature extractors commonly used in computer vision struggle in scenarios with limited object visibility. In this work, we address these challenges by proposing M2R2, a multimodal feature extractor tailored for TAS, which combines information from both proprioceptive and exteroceptive sensors. We introduce a novel training strategy that enables the reuse of learned features across multiple TAS models. Our method sets a new state-of-the-art performance on three robotic datasets REASSEMBLE, (Im)PerfectPour, and JIGSAWS. Additionally, we conduct an extensive ablation study to evaluate the contribution of different modalities in robotic TAS tasks.