Language as a Latent Variable for Reasoning Optimization

arXiv cs.CL / 4/24/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that language in LLMs acts as a latent variable influencing internal reasoning pathways, not just as an output formatting medium.
In a “Polyglot Thinking Experiment,” models solving identical problems under language-constrained vs. unconstrained prompting often perform better when language is unconstrained, with non-English outputs showing higher reasoning accuracy.
It introduces polyGRPO, an RL optimization framework that uses language variation as an implicit exploration signal and generates polyglot preference data online for improved reasoning.
Training polyGRPO on only 18.1K multilingual math problems (without chain-of-thought annotations) yields sizable accuracy gains for Qwen2.5-7B-Instruct across both English reasoning test sets and multilingual benchmarks.
The approach also reportedly surpasses the base model on an English commonsense reasoning task despite being trained only on math data, suggesting strong cross-task generalization driven by expanded latent reasoning space.

Abstract

As LLMs reduce English-centric bias, a surprising trend emerges: non-English responses sometimes outperform English on reasoning tasks. We hypothesize that language functions as a latent variable that structurally modulates the model's internal inference pathways, rather than merely serving as an output medium. To test this, we conducted a Polyglot Thinking Experiment, in which models were prompted to solve identical problems under language-constrained and language-unconstrained conditions. Results show that non-English responses often achieve higher accuracy, and the best performance frequently occur when language is unconstrained, suggesting that multilinguality broadens the model's latent reasoning space. Based on this insight, we propose polyGRPO (Polyglot Group Relative Policy Optimization), an RL framework that treats language variation as an implicit exploration signal. It generates polyglot preference data online under language-constrained and unconstrained conditions, optimizing the policy with respect to both answer accuracy and reasoning structure. Trained on only 18.1K multilingual math problems without chain-of-thought annotations, polyGRPO improves the base model (Qwen2.5-7B-Instruct) by 6.72% absolute accuracy on four English reasoning testset and 6.89% in their multilingual benchmark. Remarkably, it is the only method that surpasses the base LLM on English commonsense reasoning task (4.9%), despite being trained solely on math data-highlighting its strong cross-task generalization. Further analysis reveals that treating language as a latent variable expands the model's latent reasoning space, yielding consistent and generalizable improvements in reasoning performance.