Beyond Majority Voting: Efficient Best-Of-N with Radial Consensus Score

arXiv cs.CL / 4/15/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes Radial Consensus Score (RCS), a training-free method for best-of-N response selection in LLMs that goes beyond simple majority voting.
  • RCS embeds candidate answers, computes a weighted semantic center via a (weighted) Fréchet mean, and ranks candidates by their radial distance to that center to model semantic consensus.
  • The method supports multiple weighting schemes (uniform, frequency-based, probability-based), allowing it to incorporate agreement signals and model confidence even in black-box settings.
  • Experiments on seven QA/reasoning benchmarks using five open-weight models show RCS consistently outperforms strong baselines, with larger improvements as the sampling budget increases.
  • RCS also works as a drop-in replacement for majority voting in multi-agent debate and demonstrates robustness in black-box scenarios, suggesting geometric consensus as a scalable aggregation principle.

Abstract

Large language models (LLMs) frequently generate multiple candidate responses for a given prompt, yet selecting the most reliable one remains challenging, especially when correctness diverges from surface-level majority agreement. Existing approaches, such as self-consistency, rely on discrete voting, while probability-based methods often fail to capture relationships among candidate answers or tend to underweight high-quality but less frequent responses, and do not fully leverage the geometric structure of answer representations. To address these limitations, we introduce Radial Consensus Score (RCS), a simple, efficient, and training-free method for best-of-N selection. RCS models semantic consensus by computing a weighted Fr\'echet mean (semantic center) of answer embeddings and ranking candidates by their radial distance to this center. Importantly, RCS provides a general framework that supports multiple weighting schemes, including uniform, frequency-based, and probability-based variants, enabling flexible integration of agreement signals and model confidence while remaining fully applicable in black-box settings. Extensive experiments across seven benchmarks covering short-form QA and long-form reasoning tasks, and five open-weight models, demonstrate that RCS variants consistently outperform strong baselines, with gains becoming more pronounced as the sampling budget increases. RCS also serves as an effective drop-in replacement for majority voting in multi-agent debate and exhibits strong robustness in black-box scenarios. Overall, these results highlight geometric consensus as a scalable and broadly applicable principle for reliable answer selection, extending beyond majority voting to more expressive and robust aggregation in LLM inference.