MOSS-TTSD: Text to Spoken Dialogue Generation

arXiv cs.CL / 3/23/2026

📰 NewsModels & Research

Key Points

  • The paper introduces MOSS-TTSD, a spoken dialogue synthesis model for expressive, multi-party conversational speech across languages, addressing long-context modeling and cross-turn coherence.
  • It enables long-form single-pass synthesis of up to 60 minutes, supports up to five speakers, and includes zero-shot voice cloning from a short reference clip.
  • It also proposes TTSD-eval, an objective evaluation framework based on forced alignment to measure speaker attribution and similarity without relying on diarization tools.
  • The model is shown to surpass strong open-source and proprietary baselines in both objective and subjective evaluations of dialogue synthesis across languages.
  • With applications in podcasts, dynamic commentary, and entertainment, MOSS-TTSD signals a major advance in real-time, multi-speaker voice generation.

Abstract

Spoken dialogue generation is crucial for applications like podcasts, dynamic commentary, and entertainment content, but poses significant challenges compared to single-utterance text-to-speech (TTS). Key requirements include accurate turn-taking, cross-turn acoustic consistency, and long-form stability, which current models often fail to address due to a lack of dialogue context modeling. To bridge this gap, we present MOSS-TTSD, a spoken dialogue synthesis model designed for expressive, multi-party conversational speech across multiple languages. With enhanced long-context modeling, MOSS-TTSD generates long-form spoken conversations from dialogue scripts with explicit speaker tags, supporting up to 60 minutes of single-pass synthesis, multi-party dialogue with up to 5 speakers, and zero-shot voice cloning from a short reference audio clip. The model supports various mainstream languages, including English and Chinese, and is adapted to several long-form scenarios. Additionally, to address limitations of existing evaluation methods, we propose TTSD-eval, an objective evaluation framework based on forced alignment that measures speaker attribution accuracy and speaker similarity without relying on speaker diarization tools. Both objective and subjective evaluation results show that MOSS-TTSD surpasses strong open-source and proprietary baselines in dialogue synthesis.