Multimodal Emotion Recognition via Bi-directional Cross-Attention and Temporal Modeling

arXiv cs.CV / 3/13/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

The paper proposes a multimodal emotion recognition framework for the ABAW EXPR task that uses CLIP for visual encoding and Wav2Vec 2.0 for audio, with a Temporal Convolutional Network to capture temporal dynamics.
It features a bi-directional cross-attention fusion module that enables symmetric interaction between visual and audio features to enhance cross-modal contextualization.
It introduces a text-guided contrastive objective based on CLIP text features to promote semantically aligned visual representations.
Experimental results on the ABAW 10th EXPR benchmark show the proposed framework provides a strong multimodal baseline and improves over unimodal models, highlighting the benefit of combining temporal visual modeling, audio representations, and cross-modal fusion in real-world settings.

Abstract

Emotion recognition in in-the-wild video data remains a challenging problem due to large variations in facial appearance, head pose, illumination, background noise, and the inherently dynamic nature of human affect. Relying on a single modality, such as facial expressions or speech, is often insufficient to capture these complex emotional cues. To address this issue, we propose a multimodal emotion recognition framework for the Expression (EXPR) Recognition task in the 10th Affective Behavior Analysis in-the-wild (ABAW) Challenge. Our approach leverages large-scale pre-trained models, namely CLIP for visual encoding and Wav2Vec 2.0 for audio representation learning, as frozen backbone networks. To model temporal dependencies in facial expression sequences, we employ a Temporal Convolutional Network (TCN) over fixed-length video windows. In addition, we introduce a bi-directional cross-attention fusion module, in which visual and audio features interact symmetrically to enhance cross-modal contextualization and capture complementary emotional information. A lightweight classification head is then used for final emotion prediction. We further incorporate a text-guided contrastive objective based on CLIP text features to encourage semantically aligned visual representations. Experimental results on the ABAW 10th EXPR benchmark show that the proposed framework provides a strong multimodal baseline and achieves improved performance over unimodal modeling. These results demonstrate the effectiveness of combining temporal visual modeling, audio representation learning, and cross-modal fusion for robust emotion recognition in unconstrained real-world environments.

富士通、日本初の防衛テックアクセラレータ開始防衛用マルチAIエージェント開発で共創パートナー募集のサムネイル画像

Ledge.ai

AIに心を持たせる試みについて

note

#2 : プロンプト研究講座【第17回】プロンプトの「温度感」と「湿度感」の表現

note

国内AIエージェント動向(2026/3/18号)

note

AIが参照する記事は意外と少ない

note

Multimodal Emotion Recognition via Bi-directional Cross-Attention and Temporal Modeling

Key Points

Abstract

Related Articles

富士通、日本初の防衛テックアクセラレータ開始防衛用マルチAIエージェント開発で共創パートナー募集のサムネイル画像

AIに心を持たせる試みについて

#2 : プロンプト研究講座【第17回】プロンプトの「温度感」と「湿度感」の表現

国内AIエージェント動向(2026/3/18号)

AIが参照する記事は意外と少ない

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

Abstract

Related Articles

富士通、日本初の防衛テックアクセラレータ開始 防衛用マルチAIエージェント開発で共創パートナー募集のサムネイル画像

AIに心を持たせる試みについて

#2 : プロンプト研究講座【第17回】プロンプトの「温度感」と「湿度感」の表現

国内AIエージェント動向(2026/3/18号)

AIが参照する記事は意外と少ない

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

富士通、日本初の防衛テックアクセラレータ開始防衛用マルチAIエージェント開発で共創パートナー募集のサムネイル画像