Multimodal Emotion Recognition via Bi-directional Cross-Attention and Temporal Modeling
arXiv cs.CV / 3/13/2026
📰 NewsSignals & Early TrendsModels & Research
Key Points
- The paper proposes a multimodal emotion recognition framework for the ABAW EXPR task that uses CLIP for visual encoding and Wav2Vec 2.0 for audio, with a Temporal Convolutional Network to capture temporal dynamics.
- It features a bi-directional cross-attention fusion module that enables symmetric interaction between visual and audio features to enhance cross-modal contextualization.
- It introduces a text-guided contrastive objective based on CLIP text features to promote semantically aligned visual representations.
- Experimental results on the ABAW 10th EXPR benchmark show the proposed framework provides a strong multimodal baseline and improves over unimodal models, highlighting the benefit of combining temporal visual modeling, audio representations, and cross-modal fusion in real-world settings.




