FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for \"U-Tsang, Amdo and Kham Speech Dataset Generation

arXiv cs.CL / 4/27/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The paper proposes FMSD-TTS, a few-shot multi-speaker, multi-dialect text-to-speech framework aimed at improving TTS for Tibetan’s three major dialects (U-Tsang, Amdo, Kham) where parallel corpora are scarce.
  • FMSD-TTS uses a speaker–dialect fusion module and a Dialect-Specialized Dynamic Routing Network (DSDR-Net) to model dialect-specific acoustic/linguistic variations while preserving speaker identity.
  • Experiments show the method significantly outperforms baseline approaches in dialectal expressiveness and speaker similarity, with both objective and subjective evaluations.
  • The work also validates usefulness via a speech-to-speech dialect conversion task and releases a large-scale synthetic Tibetan speech corpus plus an open-source evaluation toolkit.
  • The authors position FMSD-TTS as a practical solution for generating parallel dialectal speech using limited reference audio and explicit dialect labels, enabling faster dataset creation.

Abstract

Tibetan is a low-resource language with minimal parallel speech corpora spanning its three major dialects-\"U-Tsang, Amdo, and Kham-limiting progress in speech modeling. To address this issue, we propose FMSD-TTS, a few-shot, multi-speaker, multi-dialect text-to-speech framework that synthesizes parallel dialectal speech from limited reference audio and explicit dialect labels. Our method features a novel speaker-dialect fusion module and a Dialect-Specialized Dynamic Routing Network (DSDR-Net) to capture fine-grained acoustic and linguistic variations across dialects while preserving speaker identity. Extensive objective and subjective evaluations demonstrate that FMSD-TTS significantly outperforms baselines in both dialectal expressiveness and speaker similarity. We further validate the quality and utility of the synthesized speech through a challenging speech-to-speech dialect conversion task. Our contributions include: (1) a novel few-shot TTS system tailored for Tibetan multi-dialect speech synthesis, (2) the public release of a large-scale synthetic Tibetan speech corpus generated by FMSD-TTS, and (3) an open-source evaluation toolkit for standardized assessment of speaker similarity, dialect consistency, and audio quality.