AI Navigate

Multimodal Emotion Recognition via Bi-directional Cross-Attention and Temporal Modeling

arXiv cs.CV / 3/13/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The paper proposes a multimodal emotion recognition framework for the ABAW EXPR task that uses CLIP for visual encoding and Wav2Vec 2.0 for audio, with a Temporal Convolutional Network to capture temporal dynamics.
  • It features a bi-directional cross-attention fusion module that enables symmetric interaction between visual and audio features to enhance cross-modal contextualization.
  • It introduces a text-guided contrastive objective based on CLIP text features to promote semantically aligned visual representations.
  • Experimental results on the ABAW 10th EXPR benchmark show the proposed framework provides a strong multimodal baseline and improves over unimodal models, highlighting the benefit of combining temporal visual modeling, audio representations, and cross-modal fusion in real-world settings.

Abstract

Emotion recognition in in-the-wild video data remains a challenging problem due to large variations in facial appearance, head pose, illumination, background noise, and the inherently dynamic nature of human affect. Relying on a single modality, such as facial expressions or speech, is often insufficient to capture these complex emotional cues. To address this issue, we propose a multimodal emotion recognition framework for the Expression (EXPR) Recognition task in the 10th Affective Behavior Analysis in-the-wild (ABAW) Challenge. Our approach leverages large-scale pre-trained models, namely CLIP for visual encoding and Wav2Vec 2.0 for audio representation learning, as frozen backbone networks. To model temporal dependencies in facial expression sequences, we employ a Temporal Convolutional Network (TCN) over fixed-length video windows. In addition, we introduce a bi-directional cross-attention fusion module, in which visual and audio features interact symmetrically to enhance cross-modal contextualization and capture complementary emotional information. A lightweight classification head is then used for final emotion prediction. We further incorporate a text-guided contrastive objective based on CLIP text features to encourage semantically aligned visual representations. Experimental results on the ABAW 10th EXPR benchmark show that the proposed framework provides a strong multimodal baseline and achieves improved performance over unimodal modeling. These results demonstrate the effectiveness of combining temporal visual modeling, audio representation learning, and cross-modal fusion for robust emotion recognition in unconstrained real-world environments.