GSAR: Typed Grounding for Hallucination Detection and Recovery in Multi-Agent LLMs

arXiv cs.AI / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces GSAR, a new grounding-evaluation and replanning framework for multi-agent LLM systems that generates structured diagnostic reports during incident investigation.
  • GSAR improves hallucination handling by categorizing claims into grounded/ungrounded/contradicted/complementary and explicitly weighting evidence by its epistemic strength.
  • It computes an asymmetric, contradiction-penalized weighted groundedness score and maps that score to tiered decisions (proceed, regenerate, replan) within a bounded-iteration outer loop under a fixed compute budget.
  • The authors formalize the algorithm, prove six structural properties, and report consistent evaluation gains across five design claims using FEVER with gold Wikipedia evidence and four independently trained LLM judges.
  • The paper includes comparisons against a Vectara groundedness-oriented baseline and claims GSAR is the first published framework to combine evidence-typed scoring with tiered recovery under explicit compute constraints.

Abstract

Autonomous multi-agent LLM systems are increasingly deployed to investigate operational incidents and produce structured diagnostic reports. Their trustworthiness hinges on whether each claim is grounded in observed evidence rather than model-internal inference. Existing groundedness evaluators (binary classifiers, LLM-as-judge scalars, self-correction loops) treat supporting evidence as interchangeable and emit a single signal that offers no principled control over downstream action. We present GSAR, a grounding-evaluation and replanning framework that (i) partitions claims into a four-way typology (grounded, ungrounded, contradicted, complementary), giving first-class standing to non-redundant alternative perspectives; (ii) assigns evidence-type-specific weights reflecting epistemic strength; (iii) computes an asymmetric contradiction-penalised weighted groundedness score; and (iv) couples that score to a three-tier decision function (proceed, regenerate, replan) driving a bounded-iteration outer loop under an explicit compute budget. We formalise the algorithm, prove six structural properties, and evaluate five design claims on FEVER with gold Wikipedia evidence under four independently-trained LLM judges (gpt-5.4, claude-sonnet-4-6, claude-opus-4-7, gemini-2.5-pro). Every ablation reproduces in the same direction on every judge: bootstrap 95% CIs on the rho=0 effect exclude 0 on all four; the no-complementary ablation under Opus 4.7 has CI [-96,-68] of 200; at n=1000 three independent judges converge to DeltaS(rho=0)=+0.058. A head-to-head against Vectara HHEM-2.1-Open is included. To our knowledge, GSAR is the first published groundedness framework coupling evidence-typed scoring with tiered recovery under an explicit compute budget.