Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim Verification

arXiv cs.CL / 3/31/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces PROClaim, a courtroom-style structured multi-agent debate framework designed to improve high-stakes claim verification where LLMs are prone to hallucinations and shallow reasoning.
  • It combines Progressive RAG (P-RAG) with role-switching agents (e.g., Plaintiff, Defense, Judge) so the system can iteratively expand and refine the evidence pool during deliberation rather than relying on a single retrieval pass.
  • The method adds evidence negotiation, self-reflection, and heterogeneous multi-judge aggregation to improve calibration, robustness, and diversity of judgments.
  • In zero-shot experiments on the Check-COVID benchmark, PROClaim reaches 81.7% accuracy, improving over standard multi-agent debate by 10.0 percentage points, with P-RAG accounting for most of the gain (+7.5 pp).
  • The authors report that structural deliberation and model heterogeneity help mitigate systematic biases and provide a more reliable foundation for claim verification systems, with code and data released publicly.

Abstract

Large language models (LLMs) remain unreliable for high-stakes claim verification due to hallucinations and shallow reasoning. While retrieval-augmented generation (RAG) and multi-agent debate (MAD) address this, they are limited by one-pass retrieval and unstructured debate dynamics. We propose a courtroom-style multi-agent framework, PROClaim, that reformulates verification as a structured, adversarial deliberation. Our approach integrates specialized roles (e.g., Plaintiff, Defense, Judge) with Progressive RAG (P-RAG) to dynamically expand and refine the evidence pool during the debate. Furthermore, we employ evidence negotiation, self-reflection, and heterogeneous multi-judge aggregation to enforce calibration, robustness, and diversity. In zero-shot evaluations on the Check-COVID benchmark, PROClaim achieves 81.7% accuracy, outperforming standard multi-agent debate by 10.0 percentage points, with P-RAG driving the primary performance gains (+7.5 pp). We ultimately demonstrate that structural deliberation and model heterogeneity effectively mitigate systematic biases, providing a robust foundation for reliable claim verification. Our code and data are publicly available at https://github.com/mnc13/PROClaim.