SCATR: Simple Calibrated Test-Time Ranking

arXiv cs.LG / 4/21/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

SCATR is a new best-of-N (BoN) test-time ranking method for LLMs that improves test-time scaling by learning an efficient scorer rather than relying solely on token log-probability confidence heuristics.
The approach trains a lightweight scorer using a small calibration set and hidden representations from the base model, avoiding the high training and inference cost of learned process reward models (PRMs).
On coding and mathematical reasoning benchmarks, SCATR improves over existing confidence-based baselines by up to 9%.
Compared with LoRA fine-tuning on the same calibration data, SCATR achieves comparable accuracy while requiring up to 8000× fewer trainable parameters and reducing training and inference latency by up to 150× and 1000×, respectively.
SCATR is competitive with strong PRM baselines and can further boost accuracy by up to 7.8% on math and 4.2% on coding, while enabling up to 1000× faster inference in some settings.

Abstract

Test-time scaling (TTS) improves large language models (LLMs) by allocating additional compute at inference time. In practice, TTS is often achieved through parallel scaling: generating multiple candidate responses and selecting the best via a Best-of-N (BoN) strategy. Its effectiveness therefore hinges on the scoring function. Learned scorers such as process reward models (PRMs) can be strong, but they are expensive to train and run. Lightweight confidence heuristics based on token log-probabilities are much cheaper, yet we find that they often perform substantially worse. To improve on lightweight confidence heuristics without incurring the full cost of stronger learned scorers, we introduce SCATR, a simple and efficient BoN ranking method that learns a lightweight scorer from a small calibration set using hidden representations from the base model. Across coding and mathematical reasoning benchmarks, SCATR improves over prior confidence-based baselines by up to 9%. Relative to LoRA fine-tuning on the same calibration data, it achieves comparable accuracy with up to 8000x fewer trainable parameters and much lower compute, reducing training and inference latency by up to 150x and 1000x, respectively. SCATR is also competitive with strong PRM baselines, and in several settings improves accuracy by up to 7.8% on math and 4.2% on coding while enabling up to 1000x faster inference. Overall, SCATR offers a strong accuracy-efficiency trade-off for scalable test-time selection.

Competitive Map: 10 AI Agent Platforms vs AgentHansa

Dev.to

Every time a new model comes out, the old one is obsolete of course

Reddit r/LocalLLaMA

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆

Dev.to

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)

Dev.to

🚀 Major BrowserAct CLI Update

Dev.to

SCATR: Simple Calibrated Test-Time Ranking

Key Points

Abstract

Related Articles

Competitive Map: 10 AI Agent Platforms vs AgentHansa

Every time a new model comes out, the old one is obsolete of course

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)

🚀 Major BrowserAct CLI Update

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer