Round-Trip Translation Reveals What Frontier Multilingual Benchmarks Miss
arXiv cs.CL / 4/15/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that current frontier multilingual benchmarks often end up measuring mathematical reasoning and factual recall rather than true multilingual proficiency.
- It reports that “thinking” model variants can score much higher than “instruct” variants on these structured multilingual evaluations, while performing worse on real-world multilingual tasks like LMArena.
- To better assess multilingual capability, the authors propose round-trip translation (source → target → back to source) and use semantic gaps between the original and the final text as an error signal.
- The approach is shown to correlate almost perfectly (Spearman ρ = 0.94) with user ratings on LMArena, while requiring no human reference translations and avoiding reliance on a stronger multilingual judge.
- The authors release a new benchmark, Lost in Translation (LiT), designed to stress multilingual generation across widely spoken languages.

