EmoTrans: A Benchmark for Understanding, Reasoning, and Predicting Emotion Transitions in Multimodal LLMs

arXiv cs.CV / 4/28/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The paper introduces EmoTrans, a new benchmark designed to evaluate how multimodal LLMs understand emotion as a dynamic process rather than static emotion recognition.
  • EmoTrans includes 1,000 manually annotated multimodal video clips across 12 real-world scenarios, along with 3,000+ task-specific QA pairs for fine-grained assessment.
  • It defines four progressively challenging tasks—Emotion Change Detection, Emotion State Identification, Emotion Transition Reasoning, and Next Emotion Prediction—to test detection, reasoning, and forecasting of emotion transitions.
  • Evaluations on 18 state-of-the-art MLLMs show stronger performance on coarse change detection but persistent difficulty in fine-grained emotion-dynamics modeling, with multi-person social contexts remaining particularly challenging.
  • The authors publicly release the benchmark, evaluation protocol, and code to support future research, including at the provided GitHub repository.

Abstract

Recent multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and generation, and are increasingly used in applications such as social robots and human-computer interaction, where understanding human emotions is essential. However, existing benchmarks mainly formulate emotion understanding as a static recognition problem, leaving it largely unclear whether current MLLMs can understand emotion as a dynamic process that evolves, shifts between states, and unfolds across diverse social contexts. To bridge this gap, we present EmoTrans, a benchmark for evaluating emotion dynamics understanding in multimodal videos. EmoTrans contains 1,000 carefully collected and manually annotated video clips, covering 12 real-world scenarios, and further provides over 3,000 task-specific question-answer (QA) pairs for fine-grained evaluation. The benchmark introduces four tasks, namely Emotion Change Detection (ECD), Emotion State Identification (ESI), Emotion Transition Reasoning (ETR), and Next Emotion Prediction (NEP), forming a progressive evaluation framework from coarse-grained detection to deeper reasoning and prediction. We conduct a comprehensive evaluation of 18 state-of-the-art MLLMs on EmoTrans and obtain two main findings. First, although current MLLMs show relatively stronger performance on coarse-grained emotion change detection, they still struggle with fine-grained emotion dynamics modeling. Second, socially complex settings, especially multi-person scenarios, remain substantially challenging, while reasoning-oriented variants do not consistently yield clear improvements. To facilitate future research, we publicly release the benchmark, evaluation protocol, and code at https://github.com/Emo-gml/EmoTrans.