Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages

arXiv cs.CL / 3/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a post-training method for lower-resource languages that maintains model fluency even when alignment is driven by disfluent reward models.
  • It targets the gap that many lower-resource languages lack native-speaker instruction data and instruction-tuned models needed to generate fluent synthetic training data.
  • The method uses on-policy training to build a fluency-preserving, preference-aligned language model without instruction-tuning data in the target language.
  • In a case study on Norwegian Bokmål, native-speaker evaluations indicate the on-policy approach is crucial and beats supervised fine-tuning on machine-translated data and multilingual fine-tuning.
  • The work frames fluency preservation as a key requirement for aligning language models in settings where high-quality preference data and fluent generators are hard to obtain.

Abstract

We propose a post-training method for lower-resource languages that preserves the fluency of language models even when aligned by disfluent reward models. Preference optimization is now a well-researched topic, but previous work has mostly addressed models for English and Chinese. Lower-resource languages lack both datasets written by native speakers and instruction-tuned language models capable of generating fluent synthetic data. To address this, we focus on developing a fluent preference-aligned language model without any instruction-tuning data in the target language. Our approach uses an on-policy training method, which we compare with two common alternatives: supervised finetuning on machine-translated data and multilingual finetuning. We conduct a case study on Norwegian Bokm{\aa}l and evaluate fluency through native-speaker assessments. The results show that the on-policy aspect is crucial and outperforms the alternatives without relying on any hard-to-obtain data.