Solution for 10th Competition on Ambivalence/Hesitancy (AH) Video Recognition Challenge using Divergence-Based Multimodal Fusion
arXiv cs.CV / 3/19/2026
💬 OpinionModels & Research
Key Points
- The paper proposes a divergence-based multimodal fusion to address Ambivalence/Hesitancy video recognition in the ABAW CVPR 2026 competition by explicitly modeling cross-modal conflict among visual, audio, and text streams.
- Visual features are encoded as Action Units (AUs) via Py-Feat, audio via Wav2Vec 2.0, and text via BERT, with each modality processed by a BiLSTM and attention pooling to produce a shared embedding.
- The fusion module uses pairwise absolute differences between modality embeddings to capture cross-modal incongruence that characterizes A/H.
- On the BAH dataset, the method achieves a Macro F1 of 0.6808 on the validation set, outperforming the baseline of 0.2827.
- Statistical analysis across 1,132 videos identifies temporal variability of AUs as the dominant visual discriminator of Ambivalence/Hesitancy.
Related Articles

I made a 'benchmark' where LLMs write code controlling units in a 1v1 RTS game.
Dev.to

My AI Does Not Have a Clock
Dev.to
How to settle on a coding LLM ? What parameters to watch out for ?
Reddit r/LocalLLaMA

Andrej Karpathy's autonomous AI research agent ran 700 experiments in 2 days and gave a glimpse of where AI is heading
Reddit r/artificial

So cursor admits that Kimi K2.5 is the best open source model
Reddit r/LocalLLaMA