AI Navigate

Team LEYA in 10th ABAW Competition: Multimodal Ambivalence/Hesitancy Recognition Approach

arXiv cs.AI / 3/16/2026

📰 NewsModels & Research

Key Points

  • The paper proposes a multimodal approach for video-level ambivalence/hesitancy recognition that integrates scene, face, audio, and text information.
  • It employs VideoMAE for scene dynamics, emotion-based face embeddings with statistical pooling, EmotionWav2Vec2.0 with a Mamba temporal encoder for audio, and fine-tuned transformer models for text, followed by prototype-augmented multimodal fusion.
  • On the BAH corpus, multimodal fusion outperforms unimodal baselines, achieving an average MF1 of 83.25% with the best fusion model and 71.43% final test performance via ensemble of prototype-augmented models.
  • The results underscore the importance of combining multiple cues and robust fusion strategies for accurate ambivalence/hesitancy recognition in unconstrained videos.

Abstract

Ambivalence/hesitancy recognition in unconstrained videos is a challenging problem due to the subtle, multimodal, and context-dependent nature of this behavioral state. In this paper, a multimodal approach for video-level ambivalence/hesitancy recognition is presented for the 10th ABAW Competition. The proposed approach integrates four complementary modalities: scene, face, audio, and text. Scene dynamics are captured with a VideoMAE-based model, facial information is encoded through emotional frame-level embeddings aggregated by statistical pooling, acoustic representations are extracted with EmotionWav2Vec2.0 and processed by a Mamba-based temporal encoder, and linguistic cues are modeled using fine-tuned transformer-based text models. The resulting unimodal embeddings are further combined using multimodal fusion models, including prototype-augmented variants. Experiments on the BAH corpus demonstrate clear gains of multimodal fusion over all unimodal baselines. The best unimodal configuration achieved an average MF1 of 70.02%, whereas the best multimodal fusion model reached 83.25%. The highest final test performance, 71.43%, was obtained by an ensemble of five prototype-augmented fusion models. The obtained results highlight the importance of complementary multimodal cues and robust fusion strategies for ambivalence/hesitancy recognition.