Aligning Paralinguistic Understanding and Generation in Speech LLMs via Multi-Task Reinforcement Learning
arXiv cs.CL / 3/18/2026
📰 NewsSignals & Early TrendsModels & Research
Key Points
- The paper addresses challenges in leveraging paralinguistic cues (prosody, emotion, and non-verbal sounds) in speech LLMs due to limited training data and annotation difficulties as well as models exploiting lexical shortcuts over paralinguistic signals.
- It introduces multi-task reinforcement learning with chain-of-thought prompting to elicit explicit affective reasoning and a paralinguistics-aware speech LLM (PALLM) that jointly optimizes sentiment classification from audio and paralinguistics-aware response generation through a two-stage pipeline.
- Experiments show 8-12% improvements on Expresso, IEMOCAP, and RAVDESS over supervised baselines and strong proprietary models (Gemini-2.5-Pro, GPT-4o-audio), highlighting the importance of modeling paralinguistic reasoning for emotionally intelligent speech LLMs.
- The results suggest that multi-task RL with explicit affective reasoning is a promising direction for building emotionally intelligent speech AI systems.




