AI Navigate

Continual Multimodal Egocentric Activity Recognition via Modality-Aware Novel Detection

arXiv cs.CV / 3/19/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The authors propose MAND, a modality-aware framework for multimodal egocentric open-world continual learning to detect novel activities while learning from non-stationary streams.
  • It introduces Modality-aware Adaptive Scoring (MoAS) to estimate sample-wise modality reliability from energy scores and adaptively fuse modality logits to better exploit cues from multiple modalities, especially IMU.
  • During training, Modality-wise Representation Stabilization Training (MoRST) preserves modality-specific discriminability across tasks via auxiliary heads and modality-wise logit distillation.
  • The approach addresses RGB-dominated logits and underutilized IMU cues, mitigating catastrophic forgetting in open-world settings.
  • Experiments on a public multimodal egocentric benchmark show up to 10% improvement in novel activity detection AUC and up to 2.8% improvement in known-class accuracy over state-of-the-art baselines.

Abstract

Multimodal egocentric activity recognition integrates visual and inertial cues for robust first-person behavior understanding. However, deploying such systems in open-world environments requires detecting novel activities while continuously learning from non-stationary streams. Existing methods rely on the main logits for novelty scoring, without fully exploiting the complementary evidence available from individual modalities. Because these logits are often dominated by RGB, cues from other modalities, particularly IMU, remain underutilized, and this imbalance worsens over time under catastrophic forgetting. To address this, we propose MAND, a modality-aware framework for multimodal egocentric open-world continual learning. At inference, Modality-aware Adaptive Scoring (MoAS) estimates sample-wise modality reliability from energy scores and adaptively integrates modality logits to better exploit complementary modality cues for novelty detection. During training, Modality-wise Representation Stabilization Training (MoRST) preserves modality-specific discriminability across tasks via auxiliary heads and modality-wise logit distillation. Experiments on a public multimodal egocentric benchmark show that MAND improves novel activity detection AUC by up to 10\% and known-class classification accuracy by up to 2.8\% over state-of-the-art baselines.