TaigiSpeech: A Low-Resource Real-World Speech Intent Dataset and Preliminary Results with Scalable Data Mining In-the-Wild
arXiv cs.CL / 3/24/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces TaigiSpeech, a low-resource, real-world Taiwanese Taigi (Taiwanese Hokkien/Southern Min) speech intent dataset collected from 21 older adult speakers with about 3,000 utterances.
- The dataset targets practical intent-detection use cases such as healthcare and home assistant scenarios, focusing on a primarily spoken and underrepresented language.
- To scale beyond limited labeled data, the authors evaluate keyword-match data mining with LLM pseudo-labeling (using an intermediate language) alongside an audio-visual, multimodal approach with minimal textual supervision.
- The project is planned for release under a CC BY 4.0 license to support broader research and adoption for low-resource and unwritten spoken languages.
- Preliminary results suggest that scalable data-mining pipelines combining weak supervision and multimodal cues can help build usable intent datasets for languages with scarce resources.
Related Articles

Black Hat Asia
AI Business

"The Agent Didn't Decide Wrong. The Instructions Were Conflicting — and Nobody Noticed."
Dev.to
Top 5 LLM Gateway Alternatives After the LiteLLM Supply Chain Attack
Dev.to

Stop Counting Prompts — Start Reflecting on AI Fluency
Dev.to

Reliable Function Calling in Deeply Recursive Union Types: Fixing Qwen Models' Double-Stringify Bug
Dev.to