Pashto Common Voice: Building the First Open Speech Corpus for a 60-Million-Speaker Low-Resource Language

arXiv cs.CL / 2026/3/31

📰 ニュースSignals & Early TrendsModels & Research

共有:

要点

The paper introduces the Pashto Common Voice (MCV) corpus, described as the first large-scale, openly licensed speech dataset for Pashto, a 60M+ speaker low-resource language largely missing from open speech tech.
Over 2022–2025 and ten Mozilla Common Voice releases (CV14–CV23), the dataset expanded from 1.5 hours with 5 contributors to 147 hours with 1,483 speakers, with participation increasing ~108× between CV17 and CV18 after a VOA Pashto broadcast campaign.
The authors detail an end-to-end methodology including Pashto interface localization, Wikipedia-based sentence extraction with automated filtering, phoneme/character-targeted contributions for commonly dropped characters, and multi-channel outreach.
MCV23 includes 107,781 clips (82.33 validated hours) across 13 content domains, and fine-tuning Whisper Base on MCV20 is reported to achieve 13.4% WER versus 99.0% zero-shot WER for Whisper Base on Pashto.
By combining community sourcing with targeted data collection and model fine-tuning results, the work provides a practical pathway to improve Pashto ASR performance using open speech resources.

Abstract

We present the Pashto Common Voice corpus -- the first large-scale, openly licensed speech resource for Pashto, a language with over 60 million native speakers largely absent from open speech technology. Through a community effort spanning 2022-2025, the corpus grew from 1.5 hours and 5 contributors to 147 total hours and 1,483 unique speakers across ten Mozilla Common Voice releases (CV14-CV23). Speaker participation increased approximately 108-fold between CV17 and CV18, coinciding with a VOA Pashto broadcast campaign. We describe the full methodology: interface localisation, Wikipedia-based sentence extraction with automated filtering, phonemically targeted contributions for the four most frequently dropped Pashto characters, and multi-channel community outreach. MCV23 contains 107,781 clips (60,337 validated; 82.33 validated hours) across 13 content domains. Fine-tuning Whisper Base on the MCV20 yields 13.4% WER on the MCV20 test split, against the published Whisper Base zero-shot WER of 99.0% on Pashto.

Black Hat Asia

AI Business

ラピダスCTO、1ナノでTSMCと「半年差に」まずは信頼獲得から

日経XTECH

「Galaxy S26 Ultra」、のぞき見防ぐ最上機買って分かったAIの進化

日経XTECH

RotorQuant vs TurboQuant — KVキャッシュ量子化の最前線

Qiita

【備忘録】分類モデルの基本的な評価指標（Accuracy / Recall / Precision / F1スコア）まとめ

Qiita

Pashto Common Voice: Building the First Open Speech Corpus for a 60-Million-Speaker Low-Resource Language

要点

Abstract

関連記事

Black Hat Asia

ラピダスCTO、1ナノでTSMCと「半年差に」まずは信頼獲得から

「Galaxy S26 Ultra」、のぞき見防ぐ最上機買って分かったAIの進化

RotorQuant vs TurboQuant — KVキャッシュ量子化の最前線

【備忘録】分類モデルの基本的な評価指標（Accuracy / Recall / Precision / F1スコア）まとめ

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

要点

Abstract

関連記事

Black Hat Asia

ラピダスCTO、1ナノでTSMCと「半年差に」 まずは信頼獲得から

「Galaxy S26 Ultra」、のぞき見防ぐ最上機 買って分かったAIの進化

RotorQuant vs TurboQuant — KVキャッシュ量子化の最前線

【備忘録】分類モデルの基本的な評価指標（Accuracy / Recall / Precision / F1スコア）まとめ

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

ラピダスCTO、1ナノでTSMCと「半年差に」まずは信頼獲得から

「Galaxy S26 Ultra」、のぞき見防ぐ最上機買って分かったAIの進化