Pashto Common Voice: Building the First Open Speech Corpus for a 60-Million-Speaker Low-Resource Language
arXiv cs.CL / 2026/3/31
📰 ニュースSignals & Early TrendsModels & Research
要点
- The paper introduces the Pashto Common Voice (MCV) corpus, described as the first large-scale, openly licensed speech dataset for Pashto, a 60M+ speaker low-resource language largely missing from open speech tech.
- Over 2022–2025 and ten Mozilla Common Voice releases (CV14–CV23), the dataset expanded from 1.5 hours with 5 contributors to 147 hours with 1,483 speakers, with participation increasing ~108× between CV17 and CV18 after a VOA Pashto broadcast campaign.
- The authors detail an end-to-end methodology including Pashto interface localization, Wikipedia-based sentence extraction with automated filtering, phoneme/character-targeted contributions for commonly dropped characters, and multi-channel outreach.
- MCV23 includes 107,781 clips (82.33 validated hours) across 13 content domains, and fine-tuning Whisper Base on MCV20 is reported to achieve 13.4% WER versus 99.0% zero-shot WER for Whisper Base on Pashto.
- By combining community sourcing with targeted data collection and model fine-tuning results, the work provides a practical pathway to improve Pashto ASR performance using open speech resources.
