TaigiSpeech: A Low-Resource Real-World Speech Intent Dataset and Preliminary Results with Scalable Data Mining In-the-Wild

arXiv cs.CL / 3/24/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces TaigiSpeech, a low-resource, real-world Taiwanese Taigi (Taiwanese Hokkien/Southern Min) speech intent dataset collected from 21 older adult speakers with about 3,000 utterances.
The dataset targets practical intent-detection use cases such as healthcare and home assistant scenarios, focusing on a primarily spoken and underrepresented language.
To scale beyond limited labeled data, the authors evaluate keyword-match data mining with LLM pseudo-labeling (using an intermediate language) alongside an audio-visual, multimodal approach with minimal textual supervision.
The project is planned for release under a CC BY 4.0 license to support broader research and adoption for low-resource and unwritten spoken languages.
Preliminary results suggest that scalable data-mining pipelines combining weak supervision and multimodal cues can help build usable intent datasets for languages with scarce resources.

Abstract

Speech technologies have advanced rapidly and serve diverse populations worldwide. However, many languages remain underrepresented due to limited resources. In this paper, we introduce \textbf{TaigiSpeech}, a real-world speech intent dataset in Taiwanese Taigi (aka Taiwanese Hokkien/Southern Min), which is a low-resource and primarily spoken language. The dataset is collected from older adults, comprising 21 speakers with a total of 3k utterances. It is designed for practical intent detection scenarios, including healthcare and home assistant applications. To address the scarcity of labeled data, we explore two data mining strategies with two levels of supervision: keyword match data mining with LLM pseudo labeling via an intermediate language and an audio-visual framework that leverages multimodal cues with minimal textual supervision. This design enables scalable dataset construction for low-resource and unwritten spoken languages. TaigiSpeech will be released under the CC BY 4.0 license to facilitate broad adoption and research on low-resource and unwritten languages. The project website and the dataset can be found on https://kwchang.org/taigispeech.

Black Hat Asia

AI Business

"The Agent Didn't Decide Wrong. The Instructions Were Conflicting — and Nobody Noticed."

Dev.to

Top 5 LLM Gateway Alternatives After the LiteLLM Supply Chain Attack

Dev.to

Stop Counting Prompts — Start Reflecting on AI Fluency

Dev.to

Reliable Function Calling in Deeply Recursive Union Types: Fixing Qwen Models' Double-Stringify Bug

Dev.to

TaigiSpeech: A Low-Resource Real-World Speech Intent Dataset and Preliminary Results with Scalable Data Mining In-the-Wild

Key Points

Abstract

Related Articles

Black Hat Asia

"The Agent Didn't Decide Wrong. The Instructions Were Conflicting — and Nobody Noticed."

Top 5 LLM Gateway Alternatives After the LiteLLM Supply Chain Attack

Stop Counting Prompts — Start Reflecting on AI Fluency

Reliable Function Calling in Deeply Recursive Union Types: Fixing Qwen Models' Double-Stringify Bug

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer