PashtoCorp: A 1.25-Billion-Word Corpus, Evaluation Suite, and Reproducible Pipeline for Low-Resource Language Development

arXiv cs.CL / 3/18/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

PashtoCorp is a 1.25-billion-word Pashto corpus assembled from 39 sources spanning HuggingFace datasets and 32 custom web scrapers, making it the largest Pashto resource to date (40x OSCAR Pashto, 83x the previous largest).
It uses a reproducible pipeline with Arabic-script tokenization, SHA-256 deduplication, and quality filtering to ensure clean data for training and evaluation.
Pretraining XLM-R-base on PashtoCorp reduces held-out perplexity by 25.1% (8.08 → 6.06), indicating strong language modeling improvements.
On WikiANN Pashto NER, the model achieves a 10% relative F1 gain (19.0% → 21.0%), reduces training variance by about 7x, and shows Wikipedia is a critical source (removing it lowers F1 by 47%).
On Belebele Pashto reading comprehension, Gemma-3n reaches 64.6% accuracy, marking the first published Pashto LLM baseline for this benchmark; the data, model, and code are publicly available on HuggingFace and GitHub.

Abstract

We present PashtoCorp, a 1.25-billion-word corpus for Pashto, a language spoken by 60 million people that remains severely underrepresented in NLP. The corpus is assembled from 39 sources spanning seven HuggingFace datasets and 32 purpose-built web scrapers, processed through a reproducible pipeline with Arabic-script tokenization, SHA-256 deduplication, and quality filtering. At 1.25B words across 2.81 million documents, PashtoCorp is 40x larger than the OSCAR Pashto subset and 83x larger than the previously largest dedicated Pashto corpus. Continued MLM pretraining of XLM-R-base on PashtoCorp reduces held-out perplexity by 25.1% (8.08->6.06). On WikiANN Pashto NER, the pretrained model improves entity F1 by 10% relative (19.0%->21.0%) and reduces training variance nearly 7x; the largest gain appears at 50 training sentences (+27%), with PashtoCorp covering 97.9% of WikiANN entity vocabulary. On Belebele Pashto reading comprehension, Gemma-3n achieves 64.6% accuracy, the first published LLM baseline for Pashto on this benchmark. A leave-one-out source ablation shows that Wikipedia (0.7% of documents) is the most critical source for NER: removing it alone reduces entity F1 by 47%. Corpus data, trained model, and code are available at https://huggingface.co/datasets/ihanif/pashto-corpus, https://huggingface.co/ihanif/xlmr-pashto, and https://github.com/ihanif/pashto-corpus.