Towards High-Quality Machine Translation for Kokborok: A Low-Resource Tibeto-Burman Language of Northeast India

arXiv cs.CL / 4/23/2026

📰 NewsSignals & Early TrendsIndustry & Market MovesModels & Research

Key Points

  • The study introduces KokborokMT, a neural machine translation system for Kokborok, a low-resource Tibeto-Burman language spoken in Tripura, India.
  • The authors fine-tune NLLB-200-distilled-600M using a multi-source parallel dataset totaling 36,052 sentence pairs, combining professional translations, Bible-domain data, and synthetic back-translations generated with Gemini Flash.
  • They add a dedicated Kokborok language token to the NLLB framework to better support the language in the model.
  • Evaluation shows the best model reaches BLEU scores of 17.30 and 38.56 on held-out test sets, with human assessments indicating solid adequacy (3.74/5) and fluency (3.70/5).
  • The reported gains substantially outperform earlier MT attempts that were trained on small Bible-derived corpora and achieved BLEU under 7.

Abstract

We present KokborokMT, a high-quality neural machine translation (NMT) system for Kokborok (ISO 639-3), a Tibeto-Burman language spoken primarily in Tripura, India with approximately 1.5 million speakers. Despite its status as an official language of Tripura, Kokborok has remained severely under-resourced in the NLP community, with prior machine translation attempts limited to systems trained on small Bible-derived corpora achieving BLEU scores below 7. We fine-tune the NLLB-200-distilled-600M model on a multi-source parallel corpus comprising 36,052 sentence pairs: 9,284 professionally translated sentences from the SMOL dataset, 1,769 Bible-domain sentences from WMT shared task data, and 24,999 synthetic back-translated pairs generated via Gemini Flash from Tatoeba English source sentences. We introduce as a new language token for Kokborok in the NLLB framework. Our best system achieves BLEU scores of 17.30 and 38.56 on held-out test sets, representing substantial improvements over prior published results. Human evaluation by three annotators yields mean adequacy of 3.74/5 and fluency of 3.70/5, with substantial agreement between trained evaluators.