TaigiSpeech: A Low-Resource Real-World Speech Intent Dataset and Preliminary Results with Scalable Data Mining In-the-Wild

arXiv cs.CL / 3/24/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces TaigiSpeech, a low-resource, real-world Taiwanese Taigi (Taiwanese Hokkien/Southern Min) speech intent dataset collected from 21 older adult speakers with about 3,000 utterances.
  • The dataset targets practical intent-detection use cases such as healthcare and home assistant scenarios, focusing on a primarily spoken and underrepresented language.
  • To scale beyond limited labeled data, the authors evaluate keyword-match data mining with LLM pseudo-labeling (using an intermediate language) alongside an audio-visual, multimodal approach with minimal textual supervision.
  • The project is planned for release under a CC BY 4.0 license to support broader research and adoption for low-resource and unwritten spoken languages.
  • Preliminary results suggest that scalable data-mining pipelines combining weak supervision and multimodal cues can help build usable intent datasets for languages with scarce resources.

Abstract

Speech technologies have advanced rapidly and serve diverse populations worldwide. However, many languages remain underrepresented due to limited resources. In this paper, we introduce \textbf{TaigiSpeech}, a real-world speech intent dataset in Taiwanese Taigi (aka Taiwanese Hokkien/Southern Min), which is a low-resource and primarily spoken language. The dataset is collected from older adults, comprising 21 speakers with a total of 3k utterances. It is designed for practical intent detection scenarios, including healthcare and home assistant applications. To address the scarcity of labeled data, we explore two data mining strategies with two levels of supervision: keyword match data mining with LLM pseudo labeling via an intermediate language and an audio-visual framework that leverages multimodal cues with minimal textual supervision. This design enables scalable dataset construction for low-resource and unwritten spoken languages. TaigiSpeech will be released under the CC BY 4.0 license to facilitate broad adoption and research on low-resource and unwritten languages. The project website and the dataset can be found on https://kwchang.org/taigispeech.