FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for \"U-Tsang, Amdo and Kham Speech Dataset Generation

arXiv cs.CL / 4/27/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The paper proposes FMSD-TTS, a few-shot multi-speaker, multi-dialect text-to-speech framework aimed at improving TTS for Tibetan’s three major dialects (U-Tsang, Amdo, Kham) where parallel corpora are scarce.
FMSD-TTS uses a speaker–dialect fusion module and a Dialect-Specialized Dynamic Routing Network (DSDR-Net) to model dialect-specific acoustic/linguistic variations while preserving speaker identity.
Experiments show the method significantly outperforms baseline approaches in dialectal expressiveness and speaker similarity, with both objective and subjective evaluations.
The work also validates usefulness via a speech-to-speech dialect conversion task and releases a large-scale synthetic Tibetan speech corpus plus an open-source evaluation toolkit.
The authors position FMSD-TTS as a practical solution for generating parallel dialectal speech using limited reference audio and explicit dialect labels, enabling faster dataset creation.

Abstract

Tibetan is a low-resource language with minimal parallel speech corpora spanning its three major dialects-\"U-Tsang, Amdo, and Kham-limiting progress in speech modeling. To address this issue, we propose FMSD-TTS, a few-shot, multi-speaker, multi-dialect text-to-speech framework that synthesizes parallel dialectal speech from limited reference audio and explicit dialect labels. Our method features a novel speaker-dialect fusion module and a Dialect-Specialized Dynamic Routing Network (DSDR-Net) to capture fine-grained acoustic and linguistic variations across dialects while preserving speaker identity. Extensive objective and subjective evaluations demonstrate that FMSD-TTS significantly outperforms baselines in both dialectal expressiveness and speaker similarity. We further validate the quality and utility of the synthesized speech through a challenging speech-to-speech dialect conversion task. Our contributions include: (1) a novel few-shot TTS system tailored for Tibetan multi-dialect speech synthesis, (2) the public release of a large-scale synthetic Tibetan speech corpus generated by FMSD-TTS, and (3) an open-source evaluation toolkit for standardized assessment of speaker similarity, dialect consistency, and audio quality.