AI Navigate

Aligning Paralinguistic Understanding and Generation in Speech LLMs via Multi-Task Reinforcement Learning

arXiv cs.CL / 3/18/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The paper addresses challenges in leveraging paralinguistic cues (prosody, emotion, and non-verbal sounds) in speech LLMs due to limited training data and annotation difficulties as well as models exploiting lexical shortcuts over paralinguistic signals.
  • It introduces multi-task reinforcement learning with chain-of-thought prompting to elicit explicit affective reasoning and a paralinguistics-aware speech LLM (PALLM) that jointly optimizes sentiment classification from audio and paralinguistics-aware response generation through a two-stage pipeline.
  • Experiments show 8-12% improvements on Expresso, IEMOCAP, and RAVDESS over supervised baselines and strong proprietary models (Gemini-2.5-Pro, GPT-4o-audio), highlighting the importance of modeling paralinguistic reasoning for emotionally intelligent speech LLMs.
  • The results suggest that multi-task RL with explicit affective reasoning is a promising direction for building emotionally intelligent speech AI systems.

Abstract

Speech large language models (LLMs) observe paralinguistic cues such as prosody, emotion, and non-verbal sounds--crucial for intent understanding. However, leveraging these cues faces challenges: limited training data, annotation difficulty, and models exploiting lexical shortcuts over paralinguistic signals. We propose multi-task reinforcement learning (RL) with chain-of-thought prompting that elicits explicit affective reasoning. To address data scarcity, we introduce a paralinguistics-aware speech LLM (PALLM) that jointly optimizes sentiment classification from audio and paralinguistics-aware response generation via a two-stage pipeline. Experiments demonstrate that our approach improves paralinguistics understanding over both supervised baselines and strong proprietary models (Gemini-2.5-Pro, GPT-4o-audio) by 8-12% on Expresso, IEMOCAP, and RAVDESS. The results show that modeling paralinguistic reasoning with multi-task RL is crucial for building emotionally intelligent speech LLMs.