Distance-aware Soft Prompt Learning for Multimodal Valence-Arousal Estimation
arXiv cs.CV / 3/17/2026
📰 NewsModels & Research
Key Points
- The paper introduces Distance-aware Soft Prompt Learning to bridge semantic space and continuous valence-arousal dimensions for multimodal estimation.
- It partitions the VA space into a 3x3 grid of nine emotional regions and uses a Gaussian kernel to assign soft labels based on distance to region centers, enabling fine-grained emotional transitions rather than hard categories.
- The architecture combines CLIP image encoder and Audio Spectrogram Transformer (AST) for multimodal features, uses GRUs for temporal modeling, and employs a hierarchical fusion with cross-modal attention and gated refinement.
- On the Aff-Wild2 dataset, the approach achieves competitive accuracy in unconstrained in-the-wild scenarios, demonstrating the effectiveness of the semantic-guided method.
Related Articles

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.
Reddit r/LocalLLaMA
QwenDean-4B | fine-tuned SLM for UIGen; our first attempt, looking for feedback!
Reddit r/LocalLLaMA
acestep.cpp: portable C++17 implementation of ACE-Step 1.5 music generation using GGML. Runs on CPU, CUDA, ROCm, Metal, Vulkan
Reddit r/LocalLLaMA

**Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding**
Hugging Face Blog

Newest GPU server in the lab! 72gb ampere vram!
Reddit r/LocalLLaMA