Breeze Taigi: Benchmarks and Models for Taiwanese Hokkien Speech Recognition and Synthesis

arXiv cs.AI / 3/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • Breeze Taigi introduces a standardized benchmark framework for Taigi speech recognition and synthesis, enabling reproducible cross-system comparisons using 30 Mandarin-Taigi parallel pairs.
  • It standardizes evaluation around Character Error Rate (CER) and includes normalization procedures to enable fair cross-system comparisons.
  • The authors demonstrate utility by fine-tuning Whisper on about 10,000 hours of Taigi synthetic data, achieving a 30.13% average CER on the benchmark and outperforming existing systems.
  • By providing open baseline models and reference implementations, the work offers a replicable framework with methodologies applicable to other low-resource languages and contexts.

Abstract

Taiwanese Hokkien (Taigi) presents unique opportunities for advancing speech technology methodologies that can generalize to diverse linguistic contexts. We introduce Breeze Taigi, a comprehensive framework centered on standardized benchmarks for evaluating Taigi speech recognition and synthesis systems. Our primary contribution is a reproducible evaluation methodology that leverages parallel Taiwanese Mandarin resources. We provide 30 carefully curated Mandarin-Taigi audio pairs from Taiwan's Executive Yuan public service announcements with normalized ground truth transcriptions. We establish Character Error Rate (CER) as the standard metric and implement normalization procedures to enable fair cross-system comparisons. To demonstrate the benchmark's utility and provide reference implementations, we develop speech recognition and synthesis models through a methodology that leverages existing Taiwanese Mandarin resources and large-scale synthetic data generation. In particular, we fine-tune a Whisper model on approximately 10,000 hours of Taigi synthetic speech data. Our ASR model achieves 30.13% average CER on the benchmark, outperforming existing commercial and research systems. By providing standardized evaluation protocols, diverse training datasets, and open baseline models, we offer a replicable framework with methodologies applicable to various linguistic contexts.