DiscoUQ: Structured Disagreement Analysis for Uncertainty Quantification in LLM Agent Ensembles

arXiv cs.CL / 3/24/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces DiscoUQ, a framework for quantifying uncertainty in LLM agent ensembles by modeling structured inter-agent disagreement rather than relying on shallow vote statistics.
  • DiscoUQ extracts semantic disagreement signals from agents’ reasoning (e.g., evidence overlap, argument strength, divergence depth) and augments them with embedding-geometry features (e.g., cluster distances and dispersion).
  • It presents three progressively complex variants—DiscoUQ-LLM, DiscoUQ-Embed, and DiscoUQ-Learn—that use logistic regression and a neural network to produce calibrated confidence estimates.
  • On four benchmarks (StrategyQA, MMLU, TruthfulQA, ARC-Challenge) using a 5-agent setup with Qwen3.5-27B, DiscoUQ-LLM improves AUROC to 0.802 versus 0.791 for the best baseline while achieving better calibration (ECE 0.036 vs. 0.098).
  • The approach shows strong cross-benchmark generalization and delivers the biggest gains in ambiguous cases where agents exhibit “weak disagreement” and vote counting underperforms.

Abstract

Multi-agent LLM systems, where multiple prompted instances of a language model independently answer questions, are increasingly used for complex reasoning tasks. However, existing methods for quantifying the uncertainty of their collective outputs rely on shallow voting statistics that discard the rich semantic information in agents' reasoning. We introduce DiscoUQ, a framework that extracts and leverages the structure of inter-agent disagreement -- both linguistic properties (evidence overlap, argument strength, divergence depth) and embedding geometry (cluster distances, dispersion, cohesion) -- to produce well-calibrated confidence estimates. We propose three methods of increasing complexity: DiscoUQ-LLM (logistic regression on LLM-extracted structure features), DiscoUQ-Embed (logistic regression on embedding geometry), and DiscoUQ-Learn (a neural network combining all features). Evaluated on four diverse benchmarks (StrategyQA, MMLU, TruthfulQA, ARC-Challenge) with a 5-agent system using Qwen3.5-27B, DiscoUQ-LLM achieves an average AUROC of 0.802, outperforming the best baseline (LLM Aggregator, 0.791) while being substantially better calibrated (ECE 0.036 vs. 0.098). The learned features generalize across benchmarks with near-zero performance degradation and provide the largest improvements where they are most needed: in the ambiguous "weak disagreement" tier where simple vote counting fails.