Team RAS in 10th ABAW Competition: Multimodal Valence and Arousal Estimation Approach
arXiv cs.CV / 3/16/2026
💬 OpinionModels & Research
Key Points
- The paper presents a multimodal method for valence-arousal estimation in-the-wild by fusing face, behavior, and audio modalities.
- The face stream uses GRADA-based frame-level embeddings and Transformer-based temporal regression to capture facial dynamics.
- Behavior and audio streams are implemented with Qwen3-VL-4B-Instruct for behavior and WavLM-Large with cross-modal filtering for audio, with Mamba modeling temporal dynamics across segments.
- The authors compare two fusion strategies—Directed Cross-Modal Mixture-of-Experts Fusion and Reliability-Aware Audio-Visual Fusion—and report a Concordance Correlation Coefficient (CCC) of 0.658 on the Aff-Wild2 development set under the ABAW protocol.
Related Articles
Data Augmentation Using GANs
Dev.to
ItinBench: Benchmarking Planning Across Multiple Cognitive Dimensions with Large Language Models
arXiv cs.AI
Hyperagents
arXiv cs.AI
Teaching an Agent to Sketch One Part at a Time
arXiv cs.AI
PowerLens: Taming LLM Agents for Safe and Personalized Mobile Power Management
arXiv cs.AI