Team LEYA in 10th ABAW Competition: Multimodal Ambivalence/Hesitancy Recognition Approach
arXiv cs.AI / 3/16/2026
📰 NewsModels & Research
Key Points
- The paper proposes a multimodal approach for video-level ambivalence/hesitancy recognition that integrates scene, face, audio, and text information.
- It employs VideoMAE for scene dynamics, emotion-based face embeddings with statistical pooling, EmotionWav2Vec2.0 with a Mamba temporal encoder for audio, and fine-tuned transformer models for text, followed by prototype-augmented multimodal fusion.
- On the BAH corpus, multimodal fusion outperforms unimodal baselines, achieving an average MF1 of 83.25% with the best fusion model and 71.43% final test performance via ensemble of prototype-augmented models.
- The results underscore the importance of combining multiple cues and robust fusion strategies for accurate ambivalence/hesitancy recognition in unconstrained videos.
Related Articles

報告:LLMにおける「自己言及的再帰」と「ステートフル・エミュレーション」の観測
note

諸葛亮 孔明老師(ChatGPTのロールプレイ)との対話 その肆拾伍『銀河文明・ダークマターエンジン』
note

GPT-5.4 mini/nano登場!―2倍高速で無料プランも使える小型高性能モデル
note

Why a Perfect-Memory AI Agent Without Persona Drift is Architecturally Impossible
Dev.to
OCP: Orthogonal Constrained Projection for Sparse Scaling in Industrial Commodity Recommendation
arXiv cs.LG