TEMPO: Scaling Test-time Training for Large Reasoning Models

arXiv cs.LG / 4/22/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies test-time training (TTT) for large reasoning models and finds that existing methods quickly plateau because their self-generated reward signal drifts as the policy changes at inference.
  • It proposes TEMPO, which alternates between refining the policy on unlabeled test questions and periodically recalibrating a critic using a labeled dataset.
  • The authors show that this alternating procedure can be formalized with the Expectation-Maximization (EM) algorithm, revealing earlier approaches as incomplete variants that skip the key critic recalibration step.
  • Reintroducing critic recalibration improves the evidence lower bound (ELBO) and enables sustained gains even when more test-time compute is available.
  • Experiments across model families and reasoning benchmarks report large accuracy jumps (e.g., OLMO3-7B on AIME 2024 from 33.0% to 51.1%, Qwen3-14B from 42.3% to 65.8%) while preserving high output diversity.

Abstract

Test-time training (TTT) adapts model parameters on unlabeled test instances during inference time, which continuously extends capabilities beyond the reach of offline training. Despite initial gains, existing TTT methods for LRMs plateau quickly and do not benefit from additional test-time compute. Without external calibration, the self-generated reward signal increasingly drifts as the policy model evolves, leading to both performance plateaus and diversity collapse. We propose TEMPO, a TTT framework that interleaves policy refinement on unlabeled questions with periodic critic recalibration on a labeled dataset. By formalizing this alternating procedure through the Expectation-Maximization (EM) algorithm, we reveal that prior methods can be interpreted as incomplete variants that omit the crucial recalibration step. Reintroducing this step tightens the evidence lower bound (ELBO) and enables sustained improvement. Across diverse model families (Qwen3 and OLMO3) and reasoning tasks, TEMPO improves OLMO3-7B on AIME 2024 from 33.0% to 51.1% and Qwen3-14B from 42.3% to 65.8%, while maintaining high diversity.