Ranking Reasoning LLMs under Test-Time Scaling

arXiv cs.LG / 3/12/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The paper formalizes dense benchmark ranking for test-time scaling in reasoning LLMs and introduces Scorio, an open-source library implementing several ranking methods (paired-comparison models, IRT, voting rules, graph- and spectral-based methods).
Evaluations across 20 reasoning models on four Olympiad-style math benchmarks show full-trial rankings largely agree with the Bayesian gold standard Bayes_U@80 (mean Kendall's tau_b 0.93–0.95) and 19–34 methods recover the same ordering.
In the single-trial regime, the best methods reach Kendall's tau_b around 0.86, indicating meaningful rankings with few trials.
Using greedy decoding as an empirical prior Bayes_R0@N reduces variance at N=1 by 16–52% but can bias rankings when greedy and stochastic sampling disagree; the work suggests reliable methods for both high- and low-budget test-time scaling and releases Scorio on GitHub.

Abstract

Test-time scaling evaluates reasoning LLMs by sampling multiple outputs per prompt, but ranking models in this regime remains underexplored. We formalize dense benchmark ranking under test-time scaling and introduce Scorio, a library that implements statistical ranking methods such as paired-comparison models, item response theory (IRT) models, voting rules, and graph- and spectral-based methods. Across

20

reasoning models on four Olympiad-style math benchmarks (AIME'24, AIME'25, HMMT'25, and BrUMO'25; up to

N=80

trials), most full-trial rankings agree closely with the Bayesian gold standard

\mathrm{Bayes}_{\mathcal{U}}@80

(mean Kendall's

\tau_b = 0.93

0.95

), and

19

34

methods recover exactly the same ordering. In the single-trial regime, the best methods reach

\tau_b \approx 0.86

. Using greedy decoding as an empirical prior (

\mathrm{Bayes}_{\mathbf{R}_0}@N

) reduces variance at

N=1

16

52\%

, but can bias rankings when greedy and stochastic sampling disagree. These results identify reliable ranking methods for both high- and low-budget test-time scaling. We release Scorio as an open-source library at https://github.com/mohsenhariri/scorio.

1Password、AIエージェントのアクセス制御を統合管理する「Unified Access」発表人間・マシン・AIの資格情報を一元統制のサムネイル画像

Ledge.ai

『モンドーモンドー』｜夏目龍頭流闇文学｜AI画像生成｜自由詩｜散文詩｜ホラー｜ダークファンタジー｜深淵図書館

note

報告：LLMにおける「自己言及的再帰」と「ステートフル・エミュレーション」の観測

note

「お金、見直したいけどどこから？」AIが改善ヒントを教えてくれる、公式プロンプトを公開

note

Copilotと物語を作ってみた #213 めーっちゃボロボロこぼす女の子の物語

note

Ranking Reasoning LLMs under Test-Time Scaling

Key Points

Abstract

Related Articles

1Password、AIエージェントのアクセス制御を統合管理する「Unified Access」発表人間・マシン・AIの資格情報を一元統制のサムネイル画像

『モンドーモンドー』｜夏目龍頭流闇文学｜AI画像生成｜自由詩｜散文詩｜ホラー｜ダークファンタジー｜深淵図書館

報告：LLMにおける「自己言及的再帰」と「ステートフル・エミュレーション」の観測

「お金、見直したいけどどこから？」AIが改善ヒントを教えてくれる、公式プロンプトを公開

Copilotと物語を作ってみた #213 めーっちゃボロボロこぼす女の子の物語

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

Abstract

Related Articles

1Password、AIエージェントのアクセス制御を統合管理する「Unified Access」発表 人間・マシン・AIの資格情報を一元統制のサムネイル画像

『モンドーモンドー』｜夏目龍頭流闇文学｜AI画像生成｜自由詩｜散文詩｜ホラー｜ダークファンタジー｜深淵図書館

​報告：LLMにおける「自己言及的再帰」と「ステートフル・エミュレーション」の観測

「お金、見直したいけどどこから？」AIが改善ヒントを教えてくれる、公式プロンプトを公開

Copilotと物語を作ってみた #213 めーっちゃボロボロこぼす女の子の物語

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

1Password、AIエージェントのアクセス制御を統合管理する「Unified Access」発表人間・マシン・AIの資格情報を一元統制のサムネイル画像

報告：LLMにおける「自己言及的再帰」と「ステートフル・エミュレーション」の観測