Distance-aware Soft Prompt Learning for Multimodal Valence-Arousal Estimation
arXiv cs.CV / 3/17/2026
📰 NewsModels & Research
Key Points
- The paper introduces Distance-aware Soft Prompt Learning to bridge semantic space and continuous valence-arousal dimensions for multimodal estimation.
- It partitions the VA space into a 3x3 grid of nine emotional regions and uses a Gaussian kernel to assign soft labels based on distance to region centers, enabling fine-grained emotional transitions rather than hard categories.
- The architecture combines CLIP image encoder and Audio Spectrogram Transformer (AST) for multimodal features, uses GRUs for temporal modeling, and employs a hierarchical fusion with cross-modal attention and gated refinement.
- On the Aff-Wild2 dataset, the approach achieves competitive accuracy in unconstrained in-the-wild scenarios, demonstrating the effectiveness of the semantic-guided method.
Related Articles

Chip Smuggling Arrests, OpenClaw Is 'The Next ChatGPT,' and 81K People on AI
Dev.to
The Lemma
Dev.to
Your Agent Will Eventually Do Something Catastrophic. Here's How to Prevent It.
Dev.to
[D] Modeling online discourse escalation as a state machine (dataset + labeling approach)
Reddit r/MachineLearning
[R] Is this paper Nonsense ? [DCdetector: Dual Attention Contrastive Representation Learning for Time Series Anomaly Detection]
Reddit r/MachineLearning