AI Navigate

Ranking Reasoning LLMs under Test-Time Scaling

arXiv cs.LG / 3/12/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper formalizes dense benchmark ranking for test-time scaling in reasoning LLMs and introduces Scorio, an open-source library implementing several ranking methods (paired-comparison models, IRT, voting rules, graph- and spectral-based methods).
  • Evaluations across 20 reasoning models on four Olympiad-style math benchmarks show full-trial rankings largely agree with the Bayesian gold standard Bayes_U@80 (mean Kendall's tau_b 0.93–0.95) and 19–34 methods recover the same ordering.
  • In the single-trial regime, the best methods reach Kendall's tau_b around 0.86, indicating meaningful rankings with few trials.
  • Using greedy decoding as an empirical prior Bayes_R0@N reduces variance at N=1 by 16–52% but can bias rankings when greedy and stochastic sampling disagree; the work suggests reliable methods for both high- and low-budget test-time scaling and releases Scorio on GitHub.

Abstract

Test-time scaling evaluates reasoning LLMs by sampling multiple outputs per prompt, but ranking models in this regime remains underexplored. We formalize dense benchmark ranking under test-time scaling and introduce Scorio, a library that implements statistical ranking methods such as paired-comparison models, item response theory (IRT) models, voting rules, and graph- and spectral-based methods. Across 20 reasoning models on four Olympiad-style math benchmarks (AIME'24, AIME'25, HMMT'25, and BrUMO'25; up to N=80 trials), most full-trial rankings agree closely with the Bayesian gold standard \mathrm{Bayes}_{\mathcal{U}}@80 (mean Kendall's \tau_b = 0.93--0.95), and 19--34 methods recover exactly the same ordering. In the single-trial regime, the best methods reach \tau_b \approx 0.86. Using greedy decoding as an empirical prior (\mathrm{Bayes}_{\mathbf{R}_0}@N) reduces variance at N=1 by 16--52\%, but can bias rankings when greedy and stochastic sampling disagree. These results identify reliable ranking methods for both high- and low-budget test-time scaling. We release Scorio as an open-source library at https://github.com/mohsenhariri/scorio.