AI Navigate

Conflict-Aware Multimodal Fusion for Ambivalence and Hesitancy Recognition

arXiv cs.CV / 3/18/2026

📰 NewsModels & Research

Key Points

  • The paper presents ConflictAwareAH, a multimodal framework for ambivalence and hesitancy recognition that fuses video, audio, and text representations using pairwise cross-modal conflict features.
  • It uses bidirectional, element-wise absolute differences between modality embeddings as cues, where large discrepancies flag ambivalence/hesitancy and small differences indicate behavioral consistency.
  • It introduces a text-guided late fusion with a text-only auxiliary head, which boosts Macro F1 by about 4.1 points and helps anchor the negative class.
  • On the ABAW10 Ambivalence/Hesitancy Challenge's BAH dataset, it achieves 0.694 Macro F1 on the labelled test split and 0.715 on the private leaderboard, outperforming published multimodal baselines by over 10 points.
  • The method trains efficiently, running on a single GPU in under 25 minutes.

Abstract

Ambivalence and hesitancy (A/H) are subtle affective states where a person shows conflicting signals through different channels -- saying one thing while their face or voice tells another story. Recognising these states automatically is valuable in clinical settings, but it is hard for machines because the key evidence lives in the \emph{disagreements} between what is said, how it sounds, and what the face shows. We present \textbf{ConflictAwareAH}, a multimodal framework built for this problem. Three pre-trained encoders extract video, audio, and text representations. Pairwise conflict features -- element-wise absolute differences between modality embeddings -- serve as \emph{bidirectional} cues: large cross-modal differences flag A/H, while small differences confirm behavioural consistency and anchor the negative class. This conflict-aware design addresses a key limitation of text-dominant approaches, which tend to over-detect A/H (high F1-AH) while struggling to confirm its absence: our multimodal model improves F1-NoAH by +4.6 points over text alone and halves the class-performance gap. A complementary \emph{text-guided late fusion} strategy blends a text-only auxiliary head with the full model at inference, adding +4.1 Macro F1. On the BAH dataset from the ABAW10 Ambivalence/Hesitancy Challenge, our method reaches \textbf{0.694 Macro F1} on the labelled test split and \textbf{0.715} on the private leaderboard, outperforming published multimodal baselines by over 10 points -- all on a single GPU in under 25 minutes of training.