WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training

arXiv cs.AI / 4/17/2026

📰 NewsModels & Research

共有:

Key Points

The paper argues that end-to-end spoken dialogue models should be more expressive than cascaded systems, but many open-source models still fall short in intelligence and expressiveness.
It identifies why directly applying preference optimization or RL to spoken dialogue is difficult, focusing on issues in reward modeling and rollout sampling.
The authors propose “WavAlign,” a modality-aware adaptive post-training approach that makes RL practical for spoken dialogue by separating updates for semantics and improving acoustics.
WavAlign constrains preference updates to the semantic channel, uses explicit anchoring to refine acoustic behavior, and dynamically mixes preference updates based on rollout statistics to avoid unreliable gradients.
Experiments across multiple spoken dialogue benchmarks and architectures show consistent gains in both semantic quality and speech expressiveness.

Abstract

End-to-end spoken dialogue models have garnered significant attention because they offer a higher potential ceiling in expressiveness and perceptual ability than cascaded systems. However, the intelligence and expressiveness of current open-source spoken dialogue models often remain below expectations. Motivated by the success of online reinforcement learning(RL) in other domains, one might attempt to directly apply preference optimization to spoken dialogue models, yet this transfer is non-trivial. We analyze these obstacles from the perspectives of reward modeling and rollout sampling, focusing on how sparse preference supervision interacts with dense speech generation under shared-parameter updates. Based on the analysis, we propose a modality-aware adaptive post-training recipe that makes RL practical for spoken dialogue: it constrains preference updates to the semantic channel and improves acoustic behavior via explicit anchoring, while dynamically regulating their mixture from rollout statistics to avoid unreliable preference gradients. We evaluate the method across multiple spoken dialogue benchmarks and representative architectures, and observe consistent improvements in semantic quality and speech expressiveness.

The Memory Wall Can't Be Killed — 3 Papers Proving Every Architecture Hits It

Dev.to

The Physics Wall in 2026: 3 Papers That Show Why Node Shrinks Won't Save Us

Dev.to

Most agent frameworks miss a key distinction: what a skill is vs how it executes

Reddit r/artificial

Moonshot AI Releases Kimi K2.6 with Long-Horizon Coding, Agent Swarm Scaling to 300 Sub-Agents and 4,000 Coordinated Steps

MarkTechPost

PrismML — Introducing Ternary Bonsai: Top Intelligence at 1.58 Bits

Reddit r/LocalLLaMA

WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training

Key Points

Abstract

Related Articles

The Memory Wall Can't Be Killed — 3 Papers Proving Every Architecture Hits It

The Physics Wall in 2026: 3 Papers That Show Why Node Shrinks Won't Save Us

Most agent frameworks miss a key distinction: what a skill is vs how it executes

Moonshot AI Releases Kimi K2.6 with Long-Horizon Coding, Agent Swarm Scaling to 300 Sub-Agents and 4,000 Coordinated Steps

PrismML — Introducing Ternary Bonsai: Top Intelligence at 1.58 Bits

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer