AI Navigate

Team RAS in 10th ABAW Competition: Multimodal Valence and Arousal Estimation Approach

arXiv cs.CV / 3/16/2026

💬 OpinionModels & Research

Key Points

  • The paper presents a multimodal method for valence-arousal estimation in-the-wild by fusing face, behavior, and audio modalities.
  • The face stream uses GRADA-based frame-level embeddings and Transformer-based temporal regression to capture facial dynamics.
  • Behavior and audio streams are implemented with Qwen3-VL-4B-Instruct for behavior and WavLM-Large with cross-modal filtering for audio, with Mamba modeling temporal dynamics across segments.
  • The authors compare two fusion strategies—Directed Cross-Modal Mixture-of-Experts Fusion and Reliability-Aware Audio-Visual Fusion—and report a Concordance Correlation Coefficient (CCC) of 0.658 on the Aff-Wild2 development set under the ABAW protocol.

Abstract

Continuous emotion recognition in terms of valence and arousal under in-the-wild (ITW) conditions remains a challenging problem due to large variations in appearance, head pose, illumination, occlusions, and subject-specific patterns of affective expression. We present a multimodal method for valence-arousal estimation ITW. Our method combines three complementary modalities: face, behavior, and audio. The face modality relies on GRADA-based frame-level embeddings and Transformer-based temporal regression. We use Qwen3-VL-4B-Instruct to extract behavior-relevant information from video segments, while Mamba is used to model temporal dynamics across segments. The audio modality relies on WavLM-Large with attention-statistics pooling and includes a cross-modal filtering stage to reduce the influence of unreliable or non-speech segments. To fuse modalities, we explore two fusion strategies: a Directed Cross-Modal Mixture-of-Experts Fusion Strategy that learns interactions between modalities with adaptive weighting, and a Reliability-Aware Audio-Visual Fusion Strategy that combines visual features at the frame-level while using audio as complementary context. The results are reported on the Aff-Wild2 dataset following the 10th Affective Behavior Analysis in-the-Wild (ABAW) challenge protocol. Experiments demonstrate that the proposed multimodal fusion strategy achieves a Concordance Correlation Coefficient (CCC) of 0.658 on the Aff-Wild2 development set.