Sommelier: Scalable Open Multi-turn Audio Pre-processing for Full-duplex Speech Language Models

arXiv cs.AI / 3/30/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that the move from text LLMs to Speech Language Models (SLMs) creates strong demand for real-time full-duplex conversational systems.
  • It identifies a key bottleneck: high-quality multi-speaker, multi-turn dialogue data is scarce, while existing large resources are often single-speaker or too limited in scale.
  • It highlights how overlapping speech and back-channeling cause standard pipelines to fail, leading to diarization errors and ASR hallucinations.
  • It proposes an open-source, scalable multi-turn audio pre-processing pipeline intended to better prepare data for full-duplex speech language model training and evaluation.

Abstract

As the paradigm of AI shifts from text-based LLMs to Speech Language Models (SLMs), there is a growing demand for full-duplex systems capable of real-time, natural human-computer interaction. However, the development of such models is constrained by the scarcity of high-quality, multi-speaker conversational data, as existing large-scale resources are predominantly single-speaker or limited in volume. Addressing the complex dynamics of natural dialogue, such as overlapping and back-channeling remains a challenge, with standard processing pipelines suffering from diarization errors and ASR hallucinations. To bridge this gap, we present a robust and scalable open-source data processing pipeline designed for full-duplex model.

Sommelier: Scalable Open Multi-turn Audio Pre-processing for Full-duplex Speech Language Models | AI Navigate