Dual-Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models

arXiv cs.AI / 4/17/2026

📰 NewsModels & Research

共有:

Key Points

The paper addresses a core challenge in full-duplex spoken dialogue models: improving interaction quality via reinforcement learning when existing automated metrics are unreliable reward proxies.
It proposes a Dual-Axis Generative Reward Model that evaluates interactions with two separate axes—semantic quality and turn-taking/interaction timing—while also producing a single overall score.
The approach uses a detailed taxonomy and an annotated dataset to capture complex interaction dynamics more faithfully than timing- or behavior-only measures.
Experiments show state-of-the-art performance for interaction-quality assessment across both synthetic and real-world spoken dialogue datasets, indicating stronger reward signals for online RL.
The resulting dual evaluation outputs are positioned as diagnostic feedback that can directly guide and stabilize the learning of SDMs during reinforcement learning training.

Abstract

Achieving seamless, human-like interaction remains a key challenge for full-duplex spoken dialogue models (SDMs). Reinforcement learning (RL) has substantially enhanced text- and vision-language models, while well-designed reward signals are crucial for the performance of RL. We consider RL a promising strategy to address the key challenge for SDMs. However, a fundamental barrier persists: prevailing automated metrics for assessing interaction quality rely on superficial proxies, such as behavioral statistics or timing-prediction accuracy, failing to provide reliable reward signals for RL. On the other hand, human evaluations, despite their richness, remain costly, inconsistent, and difficult to scale. We tackle this critical barrier by proposing a Dual-Axis Generative Reward Model, which is trained to understand complex interaction dynamics using a detailed taxonomy and an annotated dataset, produces a single score and, crucially, provides separate evaluations for semantic quality and interaction timing. Such dual outputs furnish precise diagnostic feedback for SDMs and deliver a dependable, instructive reward signal suitable for online reinforcement learning. Our model achieves state-of-the-art performance on interaction-quality assessment across a wide spectrum of datasets, spanning synthetic dialogues and complex real-world interactions.