EmoMM: Benchmarking and Steering MLLM for Multimodal Emotion Recognition under Conflict and Missingness

arXiv cs.CV / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces EmoMM, a benchmark for Multimodal Emotion Recognition that explicitly includes modality-aligned, modality-conflict, and missing-modality subsets to study MLLM behavior in realistic conditions.
Extensive experiments reveal a “Video Contribution Collapse (VCC)” phenomenon in which MLLMs often downplay video evidence when token redundancy is high and modality preferences skew decisions.
To mitigate this without retraining, the authors propose CHASE (Conflict-aware Head-level Attention Steering), an inference-time, lightweight attention steering method that detects modality conflicts and reduces decision bias.
Results show CHASE improves performance across multiple settings, making MLLM-based emotion recognition more reliable in complex affective scenarios involving conflicts and missing inputs.

Abstract

Multimodal Emotion Recognition (MER) is critical for interpreting real-world interactions. While Multimodal Large Language Models (MLLM) have shown promise in MER, their internal decision-making mechanisms under modality conflict and missingness remain largely underexplored. In this paper, to systematically investigate these behaviors, we introduce EmoMM, a comprehensive benchmark featuring modality-aligned, conflict, and missing subsets. Through extensive evaluation, we uncover a Video Contribution Collapse (VCC) phenomenon, where MLLM marginalize video evidence due to high token redundancy and modality preferences. To address this, we propose Conflict-aware Head-level Attention Steering (CHASE), a lightweight mechanism that detects modality conflicts and performs inference-time attention steering, effectively mitigating decision bias without retraining the backbone. Experimental results demonstrate that CHASE consistently improves performance across various settings, significantly enhancing the reliability of MLLM in complex affective scenarios.