Sommelier: Scalable Open Multi-turn Audio Pre-processing for Full-duplex Speech Language Models

arXiv cs.AI / 3/30/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that the move from text LLMs to Speech Language Models (SLMs) creates strong demand for real-time full-duplex conversational systems.
It identifies a key bottleneck: high-quality multi-speaker, multi-turn dialogue data is scarce, while existing large resources are often single-speaker or too limited in scale.
It highlights how overlapping speech and back-channeling cause standard pipelines to fail, leading to diarization errors and ASR hallucinations.
It proposes an open-source, scalable multi-turn audio pre-processing pipeline intended to better prepare data for full-duplex speech language model training and evaluation.

Abstract

As the paradigm of AI shifts from text-based LLMs to Speech Language Models (SLMs), there is a growing demand for full-duplex systems capable of real-time, natural human-computer interaction. However, the development of such models is constrained by the scarcity of high-quality, multi-speaker conversational data, as existing large-scale resources are predominantly single-speaker or limited in volume. Addressing the complex dynamics of natural dialogue, such as overlapping and back-channeling remains a challenge, with standard processing pipelines suffering from diarization errors and ASR hallucinations. To bridge this gap, we present a robust and scalable open-source data processing pipeline designed for full-duplex model.

Claude Code tokens: what they are and how they're counted

Dev.to

Freedom and Constraints of Autonomous Agents — Self-Modification, Trust Boundaries, and Emergent Gameplay

Dev.to

Von Hammerstein’s Ghost: What a Prussian General’s Officer Typology Can Teach Us About AI Misalignment

Reddit r/artificial

Stop Tweaking Prompts: Build a Feedback Loop Instead

Dev.to

Privacy-Preserving Active Learning for autonomous urban air mobility routing under real-time policy constraints

Dev.to

Sommelier: Scalable Open Multi-turn Audio Pre-processing for Full-duplex Speech Language Models

Key Points

Abstract

Related Articles

Claude Code tokens: what they are and how they're counted

Freedom and Constraints of Autonomous Agents — Self-Modification, Trust Boundaries, and Emergent Gameplay

Von Hammerstein’s Ghost: What a Prussian General’s Officer Typology Can Teach Us About AI Misalignment

Stop Tweaking Prompts: Build a Feedback Loop Instead

Privacy-Preserving Active Learning for autonomous urban air mobility routing under real-time policy constraints

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer