TLPO: Token-Level Policy Optimization for Mitigating Language Confusion in Large Language Models
arXiv cs.AI / 4/30/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper addresses “language confusion” in large language models, where multilingual LLMs sometimes fail to consistently produce output in the intended language.
- It argues that existing sequence-level fine-tuning methods (e.g., DPO, ORPO, GRPO) can cause unintended degradation of general model capabilities because they optimize whole responses.
- The authors propose Token-Level Policy Optimization (TLPO), a fine-tuning framework that performs localized, token-level updates at error-prone positions.
- TLPO searches over candidate tokens and uses a tailored objective to suppress language-confusion-inducing outputs while preserving overall downstream task accuracy.
- Experiments on multiple multilingual LLMs and languages show TLPO improves language consistency significantly better than baseline approaches without harming performance on downstream tasks.
Related Articles

Building a Local AI Agent (Part 2): Six UX and UI Design Challenges
Dev.to

We Built a DNS-Based Discovery Protocol for AI Agents — Here's How It Works
Dev.to

Your first business opportunity in 3 commands: /register_directory in @biznode_bot, wait for matches, then /my_pulse to view...
Dev.to

Building AI Evaluation Pipelines: Automating LLM Testing from Dataset to CI/CD
Dev.to

Function Calling Harness 2: CoT Compliance from 9.91% to 100%
Dev.to