PashtoCorp: A 1.25-Billion-Word Corpus, Evaluation Suite, and Reproducible Pipeline for Low-Resource Language Development
arXiv cs.CL / 3/18/2026
📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- PashtoCorp is a 1.25-billion-word Pashto corpus assembled from 39 sources spanning HuggingFace datasets and 32 custom web scrapers, making it the largest Pashto resource to date (40x OSCAR Pashto, 83x the previous largest).
- It uses a reproducible pipeline with Arabic-script tokenization, SHA-256 deduplication, and quality filtering to ensure clean data for training and evaluation.
- Pretraining XLM-R-base on PashtoCorp reduces held-out perplexity by 25.1% (8.08 → 6.06), indicating strong language modeling improvements.
- On WikiANN Pashto NER, the model achieves a 10% relative F1 gain (19.0% → 21.0%), reduces training variance by about 7x, and shows Wikipedia is a critical source (removing it lowers F1 by 47%).
- On Belebele Pashto reading comprehension, Gemma-3n reaches 64.6% accuracy, marking the first published Pashto LLM baseline for this benchmark; the data, model, and code are publicly available on HuggingFace and GitHub.
Related Articles
How to Enforce LLM Spend Limits Per Team Without Slowing Down Your Engineers
Dev.to
v1.82.6.rc.1
LiteLLM Releases
How political censorship actually works inside Qwen, DeepSeek, GLM, and Yi: Ablation and behavioral results across 9 models
Reddit r/LocalLLaMA
Reduce errores y costos de tokens en agentes con seleccion semantica de herramientas
Dev.to
How I Built Enterprise Monitoring Software in 6 Weeks Using Structured AI Development
Dev.to