NepTam: A Nepali-Tamang Parallel Corpus and Baseline Machine Translation Experiments
arXiv cs.CL / 3/17/2026
📰 NewsModels & Research
Key Points
- NepTam20K provides a 20,000-sentence gold-standard Nepali-Tamang parallel corpus and NepTam80K provides an 80,000-sentence synthetic parallel corpus, both designed to support machine translation.
- The datasets are sentence-aligned and built through a pipeline including data scraping from Nepali news and online sources, preprocessing, semantic filtering, tense/polarity balancing (for NepTam20K), and expert translation with verification by native Tamang linguists.
- The corpus covers five domains: Agriculture, Health, Education and Technology, Culture, and General Communication.
- Baseline translation experiments using multilingual models such as mBART, M2M-100, NLLB-200, and a vanilla Transformer show that fine-tuning NLLB-200 achieves the highest sacreBLEU scores of 40.92 (Nepali-Tamang) and 45.26 (Tamang-Nepali).
Related Articles

Interesting loop
Reddit r/LocalLLaMA
Qwen3.5-122B-A10B Uncensored (Aggressive) — GGUF Release + new K_P Quants
Reddit r/LocalLLaMA
FeatherOps: Fast fp8 matmul on RDNA3 without native fp8
Reddit r/LocalLLaMA

VerityFlow-AI: Engineering a Multi-Agent Swarm for Real-Time Truth-Validation and Deep-Context Media Synthesis
Dev.to
: [R] Sinc Reconstruction for LLM Prompts: Applying Nyquist-Shannon to the Specification Axis (275 obs, 97% cost reduction, open source)
Reddit r/MachineLearning