PiCSAR: Probabilistic Confidence Selection And Ranking for Reasoning Chains

arXiv cs.CL / 5/1/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The PiCSAR method improves large language/reasoning models by applying best-of-n sampling and then ranking candidates using a confidence-based scoring function.
  • PiCSAR is training-free and scores each candidate by the joint log-likelihood of the reasoning and the final answer, which splits naturally into reasoning confidence and answer confidence.
  • Experiments on multiple reasoning benchmarks show substantial accuracy gains, including +10.18 on MATH500 and +9.81 on AIME2025.
  • Compared with baselines, PiCSAR achieves better results while using at least 2x fewer samples in 16 of 20 comparisons, indicating improved sample efficiency.
  • The authors’ analysis supports the approach by finding that correct reasoning chains consistently have significantly higher reasoning and answer confidence than incorrect ones.

Abstract

Best-of-n sampling improves the accuracy of large language models (LLMs) and large reasoning models (LRMs) by generating multiple candidate solutions and selecting the one with the highest reward. The key challenge for reasoning tasks is designing a scoring function that can identify correct reasoning chains without access to ground-truth answers. We propose Probabilistic Confidence Selection And Ranking (PiCSAR): a simple, training-free method that scores each candidate generation using the joint log-likelihood of the reasoning and final answer. The joint log-likelihood of the reasoning and final answer naturally decomposes into reasoning confidence and answer confidence. PiCSAR achieves substantial gains across diverse benchmarks (+10.18 on MATH500, +9.81 on AIME2025), outperforming baselines with at least 2x fewer samples in 16 out of 20 comparisons. Our analysis reveals that correct reasoning chains exhibit significantly higher reasoning and answer confidence, justifying the effectiveness of PiCSAR.