PiCSAR: Probabilistic Confidence Selection And Ranking for Reasoning Chains
arXiv cs.CL / 5/1/2026
📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- The PiCSAR method improves large language/reasoning models by applying best-of-n sampling and then ranking candidates using a confidence-based scoring function.
- PiCSAR is training-free and scores each candidate by the joint log-likelihood of the reasoning and the final answer, which splits naturally into reasoning confidence and answer confidence.
- Experiments on multiple reasoning benchmarks show substantial accuracy gains, including +10.18 on MATH500 and +9.81 on AIME2025.
- Compared with baselines, PiCSAR achieves better results while using at least 2x fewer samples in 16 of 20 comparisons, indicating improved sample efficiency.
- The authors’ analysis supports the approach by finding that correct reasoning chains consistently have significantly higher reasoning and answer confidence than incorrect ones.
Related Articles

Black Hat USA
AI Business

Why Autonomous Coding Agents Keep Failing — And What Actually Works
Dev.to

Text-to-image is easy. Chaining LLMs to generate, critique, and iterate on images autonomously is a routing nightmare. AgentSwarms now supports Image generation playground and creative media workflows!
Reddit r/artificial

Why Enterprise AI Pilots Fail
Dev.to

Announcing the NVIDIA Nemotron 3 Super Build Contest
Dev.to