Watching Movies Like a Human: Egocentric Emotion Understanding for Embodied Companions
arXiv cs.CV / 4/20/2026
📰 NewsModels & Research
Key Points
- The paper addresses a domain shift problem in embodied robotic agents, where emotions are inferred from egocentric screen-view recordings rather than native cinematic footage.
- It introduces EgoScreen-Emotion (ESE), a new benchmark dataset of 224 egocentric screen-view movie trailers with 28,667 temporally aligned, multi-rater annotated key frames using a confidence-aware multi-label scheme.
- The study also proposes a multimodal long-context emotion reasoning framework that integrates temporal visual evidence, narrative summaries, compressed historical context, and audio cues.
- Experiments show a large performance drop when models trained on cinematic footage are evaluated on realistic egocentric observations (Macro-F1 drops from 27.99 to 16.69), while training on ESE significantly improves robustness.
- Results indicate competitive performance relative to strong closed-source multimodal models, emphasizing the need for domain-specific data and long-context multimodal reasoning for embodied companion scenarios.
Related Articles

From Theory to Reality: Why Most AI Agent Projects Fail (And How Mine Did Too)
Dev.to

GPT-5.4-Cyber: OpenAI's Game-Changer for AI Security and Defensive AI
Dev.to
Local LLM Beginner’s Guide (Mac - Apple Silicon)
Reddit r/artificial

Is Your Skill Actually Good? Systematically Validating Agent Skills with Evals
Dev.to

Space now with memory
Dev.to