広告

Test-Time Scaling Makes Overtraining Compute-Optimal

arXiv cs.LG / 2026/4/3

💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research

要点

  • The paper introduces Train-to-Test (T^2) scaling laws to jointly optimize LLM size, training compute (tokens), and test-time inference sampling under a fixed end-to-end budget, addressing a trade-off ignored by classic pretraining scaling laws like Chinchilla.
  • By modernizing pretraining scaling with pass@k-style modeling used for test-time scaling, T^2 produces forecasts that are robust across different modeling approaches and across both task loss and task accuracy measurements.
  • Across eight downstream tasks, the authors find that when inference cost is included, the compute-optimal training strategy shifts dramatically into an overtraining regime beyond typical pretraining scaling suites.
  • They empirically validate this by pretraining heavily overtrained models located in the T^2-predicted optimal region, showing substantially stronger performance than using standard pretraining scaling alone.
  • The study further shows that the benefits persist after post-training (relevant to frontier LLM pipelines), suggesting T^2 is applicable to modern deployment settings where test-time scaling is common.

Abstract

Modern LLMs scale at test-time, e.g. via repeated sampling, where inference cost grows with model size and the number of samples. This creates a trade-off that pretraining scaling laws, such as Chinchilla, do not address. We present Train-to-Test (T^2) scaling laws that jointly optimize model size, training tokens, and number of inference samples under fixed end-to-end budgets. T^2 modernizes pretraining scaling laws with pass@k modeling used for test-time scaling, then jointly optimizes pretraining and test-time decisions. Forecasts from T^2 are robust over distinct modeling approaches: measuring joint scaling effect on the task loss and modeling impact on task accuracy. Across eight downstream tasks, we find that when accounting for inference cost, optimal pretraining decisions shift radically into the overtraining regime, well-outside of the range of standard pretraining scaling suites. We validate our results by pretraining heavily overtrained models in the optimal region that T^2 scaling forecasts, confirming their substantially stronger performance compared to pretraining scaling alone. Finally, as frontier LLMs are post-trained, we show that our findings survive the post-training stage, making T^2 scaling meaningful in modern deployments.

広告