Permutation-Consensus Listwise Judging for Robust Factuality Evaluation
arXiv cs.CL / 3/24/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper identifies candidate-order sensitivity as a key instability in listwise factuality evaluation when LLMs are used as judges to rank multiple answers.
- It proposes PCFJudge, an inference-time method that reruns the same listwise factuality-first prompt over multiple permutations of the candidate set and aggregates scores, rankings, and uncertainty into a consensus.
- Experiments on RewardBench 2 Factuality show PCFJudge can improve over direct judging by up to 7 absolute points.
- Ablation studies indicate that most of the benefit comes from permutation consensus itself rather than adding more complex arbitration mechanisms.
- The authors conclude that order-induced variance is a meaningful contributor to factuality-judging error and that averaging over nuisance presentation changes can make LLM evaluations more reliable.
Related Articles
Regulating Prompt Markets: Securities Law, Intellectual Property, and the Trading of Prompt Assets
Dev.to
Mercor competitor Deccan AI raises $25M, sources experts from India
Dev.to
How We Got Local MCP Servers Working in Claude Cowork (The Missing Guide)
Dev.to
How Should Students Document AI Usage in Academic Work?
Dev.to

I asked my AI agent to design a product launch image. Here's what came back.
Dev.to