Pashto Common Voice: Building the First Open Speech Corpus for a 60-Million-Speaker Low-Resource Language
arXiv cs.CL / 3/31/2026
📰 NewsSignals & Early TrendsModels & Research
Key Points
- The paper introduces the Pashto Common Voice (MCV) corpus, described as the first large-scale, openly licensed speech dataset for Pashto, a 60M+ speaker low-resource language largely missing from open speech tech.
- Over 2022–2025 and ten Mozilla Common Voice releases (CV14–CV23), the dataset expanded from 1.5 hours with 5 contributors to 147 hours with 1,483 speakers, with participation increasing ~108× between CV17 and CV18 after a VOA Pashto broadcast campaign.
- The authors detail an end-to-end methodology including Pashto interface localization, Wikipedia-based sentence extraction with automated filtering, phoneme/character-targeted contributions for commonly dropped characters, and multi-channel outreach.
- MCV23 includes 107,781 clips (82.33 validated hours) across 13 content domains, and fine-tuning Whisper Base on MCV20 is reported to achieve 13.4% WER versus 99.0% zero-shot WER for Whisper Base on Pashto.
- By combining community sourcing with targeted data collection and model fine-tuning results, the work provides a practical pathway to improve Pashto ASR performance using open speech resources.
Related Articles

Black Hat Asia
AI Business

How to Verify Information Online and Avoid Fake Content
Dev.to

I built an AI code reviewer solo while working full-time — honest post-launch breakdown
Dev.to
Google Stitch vs Claude: Which AI Design Tool Wins in 2026?
Dev.to

Nebius announces construction of one of Europe's largest data centres
Tech.eu