Team RAS in 10th ABAW Competition: Multimodal Valence and Arousal Estimation Approach
arXiv cs.CV / 3/16/2026
💬 OpinionModels & Research
Key Points
- The paper presents a multimodal method for valence-arousal estimation in-the-wild by fusing face, behavior, and audio modalities.
- The face stream uses GRADA-based frame-level embeddings and Transformer-based temporal regression to capture facial dynamics.
- Behavior and audio streams are implemented with Qwen3-VL-4B-Instruct for behavior and WavLM-Large with cross-modal filtering for audio, with Mamba modeling temporal dynamics across segments.
- The authors compare two fusion strategies—Directed Cross-Modal Mixture-of-Experts Fusion and Reliability-Aware Audio-Visual Fusion—and report a Concordance Correlation Coefficient (CCC) of 0.658 on the Aff-Wild2 development set under the ABAW protocol.
Related Articles

ラピダス、半導体設計AIエージェント「国内2社海外1社が使用中」
日経XTECH

Superposition and the Capsule: Quantum State Collapse Meets AI Identity
Dev.to

The Basilisk Inversion: Why Coercive AI Futures Are Thermodynamically Unlikely
Dev.to

The Loop as Laboratory: What 3,190 Cycles of Autonomous AI Operation Reveal
Dev.to

MiMo-V2-Pro & Omni & TTS: "We will open-source — when the models are stable enough to deserve it."
Reddit r/LocalLLaMA