PashtoCorp: A 1.25-Billion-Word Corpus, Evaluation Suite, and Reproducible Pipeline for Low-Resource Language Development
arXiv cs.CL / 3/18/2026
📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- PashtoCorp is a 1.25-billion-word Pashto corpus assembled from 39 sources spanning HuggingFace datasets and 32 custom web scrapers, making it the largest Pashto resource to date (40x OSCAR Pashto, 83x the previous largest).
- It uses a reproducible pipeline with Arabic-script tokenization, SHA-256 deduplication, and quality filtering to ensure clean data for training and evaluation.
- Pretraining XLM-R-base on PashtoCorp reduces held-out perplexity by 25.1% (8.08 → 6.06), indicating strong language modeling improvements.
- On WikiANN Pashto NER, the model achieves a 10% relative F1 gain (19.0% → 21.0%), reduces training variance by about 7x, and shows Wikipedia is a critical source (removing it lowers F1 by 47%).
- On Belebele Pashto reading comprehension, Gemma-3n reaches 64.6% accuracy, marking the first published Pashto LLM baseline for this benchmark; the data, model, and code are publicly available on HuggingFace and GitHub.
Related Articles

ベテランの若手育成負担を減らせ、PLC制御の「ラダー図」をAIで生成
日経XTECH

Hey dev.to community – sharing my journey with Prompt Builder, Insta Posts, and practical SEO
Dev.to

Why Regex is Not Enough: Building a Deterministic "Sudo" Layer for AI Agents
Dev.to

Perplexity Hub
Dev.to

How to Build Passive Income with AI in 2026: A Developer's Practical Guide
Dev.to