Should We be Pedantic About Reasoning Errors in Machine Translation?

arXiv cs.CL / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper examines whether machine translation outputs contain “reasoning errors” across many language pairs and defines three misalignment categories: source-misaligned, hypothesis-misaligned, and reasoning-trace-misaligned steps.
  • It uses an automated reasoning-annotation protocol to quantify the frequency of these errors, then applies weak-to-strong trace interventions (hedging, removal, re-reasoning, hindsight, and oracle) to test whether correcting reasoning improves translation.
  • Results indicate that small reasoning corrections have little impact on translation quality, while stronger interventions achieve higher resolution rates but produce mixed gains in translation quality.
  • The authors find reasoning-error identification precision varies by language (high in Urdu, lower in Spanish), and removing reasoning errors does not substantially eliminate the original errors, pointing to limited reasoning faithfulness in MT.
  • The study raises questions about how much explicit “reasoning trace” explanations in MT correspond to the true mechanisms producing correct outputs.

Abstract

Across multiple language pairings (English \to \{Spanish, French, German, Mandarin, Japanese, Urdu, Cantonese\}), we find reasoning errors in translation. To quantify how often these reasoning errors occur, we leverage an automated annotation protocol for reasoning evaluation wherein the goal is to detect if a reasoning step is any of three error categories: (1) source sentence-misaligned, (2) model hypothesis-misaligned, or (3) reasoning trace-misaligned. We probe the reasoning model with perturbed traces correcting for these identified reasoning errors using an array of weak-to-strong interventions: hedging, removal, re-reasoning after removal, hindsight, and oracle interventions. Experimenting with interventions on the reasoning traces suggests that small corrections to the reasoning have little impact on translation quality, but stronger interventions yield the highest resolution rates, despite translation quality gains being mixed. We find ultimately that reasoning errors in MT can be identified with high precision in Urdu but lower precision in Spanish, but that removing these reasoning errors does not resolve the initial errors significantly, suggesting limited reasoning faithfulness for machine translation.