NepTam: A Nepali-Tamang Parallel Corpus and Baseline Machine Translation Experiments
arXiv cs.CL / 3/17/2026
📰 NewsModels & Research
Key Points
- NepTam20K provides a 20,000-sentence gold-standard Nepali-Tamang parallel corpus and NepTam80K provides an 80,000-sentence synthetic parallel corpus, both designed to support machine translation.
- The datasets are sentence-aligned and built through a pipeline including data scraping from Nepali news and online sources, preprocessing, semantic filtering, tense/polarity balancing (for NepTam20K), and expert translation with verification by native Tamang linguists.
- The corpus covers five domains: Agriculture, Health, Education and Technology, Culture, and General Communication.
- Baseline translation experiments using multilingual models such as mBART, M2M-100, NLLB-200, and a vanilla Transformer show that fine-tuning NLLB-200 achieves the highest sacreBLEU scores of 40.92 (Nepali-Tamang) and 45.26 (Tamang-Nepali).




![[Boost]](/_next/image?url=https%3A%2F%2Fmedia2.dev.to%2Fdynamic%2Fimage%2Fwidth%3D800%252Cheight%3D%252Cfit%3Dscale-down%252Cgravity%3Dauto%252Cformat%3Dauto%2Fhttps%253A%252F%252Fdev-to-uploads.s3.amazonaws.com%252Fuploads%252Fuser%252Fprofile_image%252F3833034%252F44fa15e0-8eb9-4843-a424-a4a7b3538f43.jpeg&w=3840&q=75)