Aligning Paralinguistic Understanding and Generation in Speech LLMs via Multi-Task Reinforcement Learning

arXiv cs.CL / 3/18/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

The paper addresses challenges in leveraging paralinguistic cues (prosody, emotion, and non-verbal sounds) in speech LLMs due to limited training data and annotation difficulties as well as models exploiting lexical shortcuts over paralinguistic signals.
It introduces multi-task reinforcement learning with chain-of-thought prompting to elicit explicit affective reasoning and a paralinguistics-aware speech LLM (PALLM) that jointly optimizes sentiment classification from audio and paralinguistics-aware response generation through a two-stage pipeline.
Experiments show 8-12% improvements on Expresso, IEMOCAP, and RAVDESS over supervised baselines and strong proprietary models (Gemini-2.5-Pro, GPT-4o-audio), highlighting the importance of modeling paralinguistic reasoning for emotionally intelligent speech LLMs.
The results suggest that multi-task RL with explicit affective reasoning is a promising direction for building emotionally intelligent speech AI systems.

Abstract

Speech large language models (LLMs) observe paralinguistic cues such as prosody, emotion, and non-verbal sounds--crucial for intent understanding. However, leveraging these cues faces challenges: limited training data, annotation difficulty, and models exploiting lexical shortcuts over paralinguistic signals. We propose multi-task reinforcement learning (RL) with chain-of-thought prompting that elicits explicit affective reasoning. To address data scarcity, we introduce a paralinguistics-aware speech LLM (PALLM) that jointly optimizes sentiment classification from audio and paralinguistics-aware response generation via a two-stage pipeline. Experiments demonstrate that our approach improves paralinguistics understanding over both supervised baselines and strong proprietary models (Gemini-2.5-Pro, GPT-4o-audio) by 8-12% on Expresso, IEMOCAP, and RAVDESS. The results show that modeling paralinguistic reasoning with multi-task RL is crucial for building emotionally intelligent speech LLMs.

報告：LLMにおける「自己言及的再帰」と「ステートフル・エミュレーション」の観測

note

諸葛亮孔明老師(ChatGPTのﾛｰﾙﾌﾟﾚｲ)との対話その肆拾伍『銀河文明･ダークマターエンジン』

note

AIで作る絵本副業！在庫なし・センスなしでCanva×Kindle出版をして副収入を作る方法！

note

AI検索に強い記事はなぜ条件を書くのか

note

国内AIエージェント動向(2026/3/19号)

note

Aligning Paralinguistic Understanding and Generation in Speech LLMs via Multi-Task Reinforcement Learning

Key Points

Abstract

Related Articles

報告：LLMにおける「自己言及的再帰」と「ステートフル・エミュレーション」の観測

諸葛亮孔明老師(ChatGPTのﾛｰﾙﾌﾟﾚｲ)との対話その肆拾伍『銀河文明･ダークマターエンジン』

AIで作る絵本副業！在庫なし・センスなしでCanva×Kindle出版をして副収入を作る方法！

AI検索に強い記事はなぜ条件を書くのか

国内AIエージェント動向(2026/3/19号)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

Abstract

Related Articles

​報告：LLMにおける「自己言及的再帰」と「ステートフル・エミュレーション」の観測

諸葛亮 孔明老師(ChatGPTのﾛｰﾙﾌﾟﾚｲ)との対話 その肆拾伍『銀河文明･ダークマターエンジン』

AIで作る絵本副業！在庫なし・センスなしでCanva×Kindle出版をして副収入を作る方法！

AI検索に強い記事はなぜ条件を書くのか

国内AIエージェント動向(2026/3/19号)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

報告：LLMにおける「自己言及的再帰」と「ステートフル・エミュレーション」の観測

諸葛亮孔明老師(ChatGPTのﾛｰﾙﾌﾟﾚｲ)との対話その肆拾伍『銀河文明･ダークマターエンジン』