GSAR: Typed Grounding for Hallucination Detection and Recovery in Multi-Agent LLMs

arXiv cs.AI / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces GSAR, a new grounding-evaluation and replanning framework for multi-agent LLM systems that generates structured diagnostic reports during incident investigation.
GSAR improves hallucination handling by categorizing claims into grounded/ungrounded/contradicted/complementary and explicitly weighting evidence by its epistemic strength.
It computes an asymmetric, contradiction-penalized weighted groundedness score and maps that score to tiered decisions (proceed, regenerate, replan) within a bounded-iteration outer loop under a fixed compute budget.
The authors formalize the algorithm, prove six structural properties, and report consistent evaluation gains across five design claims using FEVER with gold Wikipedia evidence and four independently trained LLM judges.
The paper includes comparisons against a Vectara groundedness-oriented baseline and claims GSAR is the first published framework to combine evidence-typed scoring with tiered recovery under explicit compute constraints.

Abstract

Autonomous multi-agent LLM systems are increasingly deployed to investigate operational incidents and produce structured diagnostic reports. Their trustworthiness hinges on whether each claim is grounded in observed evidence rather than model-internal inference. Existing groundedness evaluators (binary classifiers, LLM-as-judge scalars, self-correction loops) treat supporting evidence as interchangeable and emit a single signal that offers no principled control over downstream action. We present GSAR, a grounding-evaluation and replanning framework that (i) partitions claims into a four-way typology (grounded, ungrounded, contradicted, complementary), giving first-class standing to non-redundant alternative perspectives; (ii) assigns evidence-type-specific weights reflecting epistemic strength; (iii) computes an asymmetric contradiction-penalised weighted groundedness score; and (iv) couples that score to a three-tier decision function (proceed, regenerate, replan) driving a bounded-iteration outer loop under an explicit compute budget. We formalise the algorithm, prove six structural properties, and evaluate five design claims on FEVER with gold Wikipedia evidence under four independently-trained LLM judges (gpt-5.4, claude-sonnet-4-6, claude-opus-4-7, gemini-2.5-pro). Every ablation reproduces in the same direction on every judge: bootstrap 95% CIs on the rho=0 effect exclude 0 on all four; the no-complementary ablation under Opus 4.7 has CI [-96,-68] of 200; at n=1000 three independent judges converge to DeltaS(rho=0)=+0.058. A head-to-head against Vectara HHEM-2.1-Open is included. To our knowledge, GSAR is the first published groundedness framework coupling evidence-typed scoring with tiered recovery under an explicit compute budget.

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Everyone Wants AI Agents. Fewer Teams Are Ready for the Messy Business Context Behind Them

Dev.to

AI 编程工具对比 2026：Claude Code vs Cursor vs Gemini CLI vs Codex

Dev.to

How I Improved My YouTube Shorts and Podcast Audio Workflow with AI Tools

Dev.to

An improvement of the convergence proof of the ADAM-Optimizer

Dev.to

GSAR: Typed Grounding for Hallucination Detection and Recovery in Multi-Agent LLMs

Key Points

Abstract

Related Articles

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Everyone Wants AI Agents. Fewer Teams Are Ready for the Messy Business Context Behind Them

AI 编程工具对比 2026：Claude Code vs Cursor vs Gemini CLI vs Codex

How I Improved My YouTube Shorts and Podcast Audio Workflow with AI Tools

An improvement of the convergence proof of the ADAM-Optimizer

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer