Watching Movies Like a Human: Egocentric Emotion Understanding for Embodied Companions

arXiv cs.CV / 4/20/2026

📰 NewsModels & Research

Key Points

  • The paper addresses a domain shift problem in embodied robotic agents, where emotions are inferred from egocentric screen-view recordings rather than native cinematic footage.
  • It introduces EgoScreen-Emotion (ESE), a new benchmark dataset of 224 egocentric screen-view movie trailers with 28,667 temporally aligned, multi-rater annotated key frames using a confidence-aware multi-label scheme.
  • The study also proposes a multimodal long-context emotion reasoning framework that integrates temporal visual evidence, narrative summaries, compressed historical context, and audio cues.
  • Experiments show a large performance drop when models trained on cinematic footage are evaluated on realistic egocentric observations (Macro-F1 drops from 27.99 to 16.69), while training on ESE significantly improves robustness.
  • Results indicate competitive performance relative to strong closed-source multimodal models, emphasizing the need for domain-specific data and long-context multimodal reasoning for embodied companion scenarios.

Abstract

Embodied robotic agents often perceive movies through an egocentric screen-view interface rather than native cinematic footage, introducing domain shifts such as viewpoint distortion, scale variation, illumination changes, and environmental interference. However, existing research on movie emotion understanding is almost exclusively conducted on cinematic footage, limiting cross-domain generalization to real-world viewing scenarios. To bridge this gap, we introduce EgoScreen-Emotion (ESE), the first benchmark dataset for egocentric screen-view movie emotion understanding. ESE contains 224 movie trailers captured under controlled egocentric screen-view conditions, producing 28,667 temporally aligned key-frames annotated by multiple raters with a confidence-aware multi-label protocol to address emotional ambiguity. We further build a multimodal long-context emotion reasoning framework that models temporal visual evidence, narrative summaries, compressed historical context, and audio cues. Cross-domain experiments reveal a severe domain gap: models trained on cinematic footage drop from 27.99 to 16.69 Macro-F1 when evaluated on realistic egocentric screen-view observations. Training on ESE substantially improves robustness under realistic viewing conditions. Our approach achieves competitive performance compared with strong closed-source multimodal models, highlighting the importance of domain-specific data and long-context multimodal reasoning.