TLPO: Token-Level Policy Optimization for Mitigating Language Confusion in Large Language Models

arXiv cs.AI / 4/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses “language confusion” in large language models, where multilingual LLMs sometimes fail to consistently produce output in the intended language.
It argues that existing sequence-level fine-tuning methods (e.g., DPO, ORPO, GRPO) can cause unintended degradation of general model capabilities because they optimize whole responses.
The authors propose Token-Level Policy Optimization (TLPO), a fine-tuning framework that performs localized, token-level updates at error-prone positions.
TLPO searches over candidate tokens and uses a tailored objective to suppress language-confusion-inducing outputs while preserving overall downstream task accuracy.
Experiments on multiple multilingual LLMs and languages show TLPO improves language consistency significantly better than baseline approaches without harming performance on downstream tasks.

Abstract

Large language models (LLMs) demonstrate strong multilingual capabilities, yet often fail to consistently generate responses in the intended language, exhibiting a phenomenon known as language confusion. Prior mitigation approaches based on sequence-level fine-tuning, such as DPO, ORPO, and GRPO, operate at the level of entire responses and can lead to unintended degradation of general model capabilities, motivating the need for more fine-grained alternatives. To address this, we introduce Token-Level Policy Optimization (TLPO), a fine-tuning framework designed to mitigate language confusion through localized, token-level updates. TLPO identifies error-prone positions, explores alternative candidate tokens, and updates the policy using a tailored objective to suppress error-inducing outputs at a granular level. This selective intervention enables effective mitigation of language confusion without compromising the model's general abilities. Experiments on multiple multilingual LLMs across diverse languages demonstrate that TLPO significantly outperforms baselines in improving language consistency while preserving downstream task accuracy.