MERIT: Multilingual Expert-Reward Informed Tuning for Chinese-Centric Low-Resource Machine Translation

arXiv cs.CL / 4/7/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses the limited quality of Chinese-to–Southeast Asian low-resource machine translation due to scarce clean parallel data and noisy mined corpora, which keeps performance far behind high-resource directions.
  • It introduces MERIT, a unified framework that creates a Chinese-centric evaluation suite by adapting the ALT benchmark to five low-resource Southeast Asian languages.
  • MERIT combines language-specific token prefixing (LTP) with supervised fine-tuning (SFT) and group relative policy optimization (GRPO) driven by a semantic alignment reward (SAR).
  • The authors report that targeted data curation and reward-guided optimization substantially outperform relying on model scaling alone for LRL↔Chinese translation.
  • Overall, the work suggests that evaluation design and reward-informed training strategies can more effectively close the gap in low-resource bilingual translation quality.

Abstract

Neural machine translation (NMT) from Chinese to low-resource Southeast Asian languages remains severely constrained by the extreme scarcity of clean parallel corpora and the pervasive noise in existing mined data. This chronic shortage not only impedes effective model training but also sustains a large performance gap with high-resource directions, leaving millions of speakers of languages such as Lao, Burmese, and Tagalog with persistently low-quality translation systems despite recent advances in large multilingual models. We introduce \textbf{M}ultilingual \textbf{E}xpert-\textbf{R}eward \textbf{I}nformed \textbf{T}uning (\textbf{MERIT}), a unified translation framework that transforms the traditional English-centric ALT benchmark into a Chinese-centric evaluation suite for five Southeast Asian low-resource languages (LRLs). Our framework combines language-specific token prefixing (LTP) with supervised fine-tuning (SFT) and a novel group relative policy optimization (GRPO) guided by the semantic alignment reward (SAR). These results confirm that, in LRL{\textrightarrow}Chinese translation, targeted data curation and reward-guided optimization dramatically outperform mere model scaling.