Solution for 10th Competition on Ambivalence/Hesitancy (AH) Video Recognition Challenge using Divergence-Based Multimodal Fusion
arXiv cs.CV / 3/19/2026
💬 OpinionModels & Research
Key Points
- The paper proposes a divergence-based multimodal fusion to address Ambivalence/Hesitancy video recognition in the ABAW CVPR 2026 competition by explicitly modeling cross-modal conflict among visual, audio, and text streams.
- Visual features are encoded as Action Units (AUs) via Py-Feat, audio via Wav2Vec 2.0, and text via BERT, with each modality processed by a BiLSTM and attention pooling to produce a shared embedding.
- The fusion module uses pairwise absolute differences between modality embeddings to capture cross-modal incongruence that characterizes A/H.
- On the BAH dataset, the method achieves a Macro F1 of 0.6808 on the validation set, outperforming the baseline of 0.2827.
- Statistical analysis across 1,132 videos identifies temporal variability of AUs as the dominant visual discriminator of Ambivalence/Hesitancy.
Related Articles
Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders
Reddit r/LocalLLaMA

How to Train Custom Language Models: Fine-Tuning vs Training From Scratch (2026)
Dev.to

KoboldCpp 1.110 - 3 YR Anniversary Edition, native music gen, qwen3tts voice cloning and more
Reddit r/LocalLLaMA
Qwen3.5 Knowledge density and performance
Reddit r/LocalLLaMA
I think I made the best general use System Prompt for Qwen 3.5 (OpenWebUI + Web search)
Reddit r/LocalLLaMA