Aligning Paralinguistic Understanding and Generation in Speech LLMs via Multi-Task Reinforcement Learning

arXiv cs.CL / 3/18/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

The paper addresses challenges in leveraging paralinguistic cues (prosody, emotion, and non-verbal sounds) in speech LLMs due to limited training data and annotation difficulties as well as models exploiting lexical shortcuts over paralinguistic signals.
It introduces multi-task reinforcement learning with chain-of-thought prompting to elicit explicit affective reasoning and a paralinguistics-aware speech LLM (PALLM) that jointly optimizes sentiment classification from audio and paralinguistics-aware response generation through a two-stage pipeline.
Experiments show 8-12% improvements on Expresso, IEMOCAP, and RAVDESS over supervised baselines and strong proprietary models (Gemini-2.5-Pro, GPT-4o-audio), highlighting the importance of modeling paralinguistic reasoning for emotionally intelligent speech LLMs.
The results suggest that multi-task RL with explicit affective reasoning is a promising direction for building emotionally intelligent speech AI systems.

Abstract

Speech large language models (LLMs) observe paralinguistic cues such as prosody, emotion, and non-verbal sounds--crucial for intent understanding. However, leveraging these cues faces challenges: limited training data, annotation difficulty, and models exploiting lexical shortcuts over paralinguistic signals. We propose multi-task reinforcement learning (RL) with chain-of-thought prompting that elicits explicit affective reasoning. To address data scarcity, we introduce a paralinguistics-aware speech LLM (PALLM) that jointly optimizes sentiment classification from audio and paralinguistics-aware response generation via a two-stage pipeline. Experiments demonstrate that our approach improves paralinguistics understanding over both supervised baselines and strong proprietary models (Gemini-2.5-Pro, GPT-4o-audio) by 8-12% on Expresso, IEMOCAP, and RAVDESS. The results show that modeling paralinguistic reasoning with multi-task RL is crucial for building emotionally intelligent speech LLMs.

Interactive Web Visualization of GPT-2

Reddit r/artificial

Stop Treating AI Interview Fraud Like a Proctoring Problem

Dev.to

[R] Causal self-attention as a probabilistic model over embeddings

Reddit r/MachineLearning

The 5 software development trends that actually matter in 2026 (and what they mean for your startup)

Dev.to

Zuckerberg Built an AI CEO. Now Someone Has to Do the Work It Delegates.

Dev.to

Aligning Paralinguistic Understanding and Generation in Speech LLMs via Multi-Task Reinforcement Learning

Key Points

Abstract

Related Articles

Interactive Web Visualization of GPT-2

Stop Treating AI Interview Fraud Like a Proctoring Problem

[R] Causal self-attention as a probabilistic model over embeddings

The 5 software development trends that actually matter in 2026 (and what they mean for your startup)

Zuckerberg Built an AI CEO. Now Someone Has to Do the Work It Delegates.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer