Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages

arXiv cs.CL / 3/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes a post-training method for lower-resource languages that maintains model fluency even when alignment is driven by disfluent reward models.
It targets the gap that many lower-resource languages lack native-speaker instruction data and instruction-tuned models needed to generate fluent synthetic training data.
The method uses on-policy training to build a fluency-preserving, preference-aligned language model without instruction-tuning data in the target language.
In a case study on Norwegian Bokmål, native-speaker evaluations indicate the on-policy approach is crucial and beats supervised fine-tuning on machine-translated data and multilingual fine-tuning.
The work frames fluency preservation as a key requirement for aligning language models in settings where high-quality preference data and fluent generators are hard to obtain.

Abstract

We propose a post-training method for lower-resource languages that preserves the fluency of language models even when aligned by disfluent reward models. Preference optimization is now a well-researched topic, but previous work has mostly addressed models for English and Chinese. Lower-resource languages lack both datasets written by native speakers and instruction-tuned language models capable of generating fluent synthetic data. To address this, we focus on developing a fluent preference-aligned language model without any instruction-tuning data in the target language. Our approach uses an on-policy training method, which we compare with two common alternatives: supervised finetuning on machine-translated data and multilingual finetuning. We conduct a case study on Norwegian Bokm{\aa}l and evaluate fluency through native-speaker assessments. The results show that the on-policy aspect is crucial and outperforms the alternatives without relying on any hard-to-obtain data.

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Simon Willison's Blog

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

Dev.to

I missed the "fun" part in software development

Dev.to

The Billion Dollar Tax on AI Agents

Dev.to

Hermes Agent: A Self-Improving AI Agent That Runs Anywhere

Dev.to

Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages

Key Points

Abstract

Related Articles

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

I missed the "fun" part in software development

The Billion Dollar Tax on AI Agents

Hermes Agent: A Self-Improving AI Agent That Runs Anywhere

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer