Improving Latent Generalization Using Test-time Compute

arXiv cs.LG / 4/3/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that failures of deductive reasoning in language models stem from weak “latent generalization” in weight-based knowledge acquisition (in-weights learning).
  • It proposes improving latent generalization without task-specific train-time augmentation by training models to use test-time “thinking” via long chain-of-thought (CoT) generated under reinforcement learning from correctness feedback.
  • Experiments show the test-time thinking approach fixes many latent generalization failures on in-distribution knowledge and, unlike augmentation baselines, can generalize to new out-of-distribution knowledge where no RL was performed.
  • On pure reversal tasks, the method does not directly enable knowledge inversion, but it improves performance by leveraging a generate-and-verify style that boosts results above chance.
  • The authors find that factual self-verification remains brittle, leaving thinking models below in-context learning performance on reversal tasks, while still positioning test-time thinking as a flexible direction for better generalization.

Abstract

Language Models (LMs) exhibit two distinct mechanisms for knowledge acquisition: in-weights learning (i.e., encoding information within the model weights) and in-context learning (ICL). Although these two modes offer complementary strengths, in-weights learning frequently struggles to facilitate deductive reasoning over the internalized knowledge. We characterize this limitation as a deficit in latent generalization, of which the reversal curse is one example. Conversely, in-context learning demonstrates highly robust latent generalization capabilities. To improve latent generalization from in-weights knowledge, prior approaches rely on train-time data augmentation, yet these techniques are task-specific, scale poorly, and fail to generalize to out-of-distribution knowledge. To overcome these shortcomings, this work studies how models can be taught to use test-time compute, or 'thinking', specifically to improve latent generalization. We use Reinforcement Learning (RL) from correctness feedback to train models to produce long chains-of-thought (CoTs) to improve latent generalization. Our experiments show that this thinking approach not only resolves many instances of latent generalization failures on in-distribution knowledge but also, unlike augmentation baselines, generalizes to new knowledge for which no RL was performed. Nevertheless, on pure reversal tasks, we find that thinking does not unlock direct knowledge inversion, but the generate-and-verify ability of thinking models enables them to get well above chance performance. The brittleness of factual self-verification means thinking models still remain well below the performance of in-context learning for this task. Overall, our results establish test-time thinking as a flexible and promising direction for improving the latent generalization of LMs.