SCATR:単純な校正付きテスト時ランキング

arXiv cs.LG / 2026/4/21

📰 ニュースSignals & Early TrendsIdeas & Deep AnalysisModels & Research

要点

  • SCATRは、ベスト・オブ・N(BoN)によるテスト時ランキング手法で、トークンの対数尤度に基づく信頼度ヒューリスティックだけに頼らず、効率的なスコアラーを学習することでテスト時スケーリングを改善します。
  • 小規模な校正データとベースモデルの隠れ表現を用いて軽量スコアラーを学習し、プロセス報酬モデル(PRM)のような学習・推論の高コストを回避します。
  • コーディングおよび数学的推論ベンチマークで、SCATRは既存の信頼度ベースのベースラインに比べて最大9%改善します。
  • 同じ校正データでのLoRA微調整と比べて、SCATRは精度が同等でありつつ、学習可能パラメータを最大8000分の1に抑え、学習と推論のレイテンシをそれぞれ最大150分の1、1000分の1まで削減します。
  • SCATRは強力なPRMベースラインとも競合し、設定によっては数学で最大7.8%、コーディングで最大4.2%の精度向上に加えて、推論を最大1000倍高速化できます。

Abstract

Test-time scaling (TTS) improves large language models (LLMs) by allocating additional compute at inference time. In practice, TTS is often achieved through parallel scaling: generating multiple candidate responses and selecting the best via a Best-of-N (BoN) strategy. Its effectiveness therefore hinges on the scoring function. Learned scorers such as process reward models (PRMs) can be strong, but they are expensive to train and run. Lightweight confidence heuristics based on token log-probabilities are much cheaper, yet we find that they often perform substantially worse. To improve on lightweight confidence heuristics without incurring the full cost of stronger learned scorers, we introduce SCATR, a simple and efficient BoN ranking method that learns a lightweight scorer from a small calibration set using hidden representations from the base model. Across coding and mathematical reasoning benchmarks, SCATR improves over prior confidence-based baselines by up to 9%. Relative to LoRA fine-tuning on the same calibration data, it achieves comparable accuracy with up to 8000x fewer trainable parameters and much lower compute, reducing training and inference latency by up to 150x and 1000x, respectively. SCATR is also competitive with strong PRM baselines, and in several settings improves accuracy by up to 7.8% on math and 4.2% on coding while enabling up to 1000x faster inference. Overall, SCATR offers a strong accuracy-efficiency trade-off for scalable test-time selection.