Facet-Level Tracing of Evidence Uncertainty and Hallucination in RAG

arXiv cs.CL / 4/13/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that hallucinations in RAG persist even with relevant documents retrieved, and existing evaluations (answer- or passage-level) miss how evidence is actually used during generation.
  • It introduces a facet-level diagnostics framework that breaks each QA question into atomic “reasoning facets” and measures evidence sufficiency/grounding via a Facet × Chunk matrix combining retrieval relevance with NLI-based faithfulness.
  • The method compares three inference modes—Strict RAG (retrieval-only), Soft RAG (retrieval plus parametric knowledge), and LLM-only (no retrieval)—to quantify retrieval-generation misalignment where relevant evidence is retrieved but not properly integrated.
  • Experiments on medical QA and HotpotQA using multiple LLMs (GPT, Gemini, LLaMA) show recurring failure modes such as evidence absence, evidence misalignment, and prior-driven overrides that are largely invisible under standard answer-level metrics.
  • The results suggest hallucination drivers in RAG are less about retrieval accuracy and more about the integration strategy between retrieved evidence and model prior knowledge, with the proposed diagnostics enabling interpretable diagnosis of those integration failures.

Abstract

Retrieval-Augmented Generation (RAG) aims to reduce hallucination by grounding answers in retrieved evidence, yet hallucinated answers remain common even when relevant documents are available. Existing evaluations focus on answer-level or passage-level accuracy, offering limited insight into how evidence is used during generation. In this work, we introduce a facet-level diagnostics framework for QA that decomposes each input question into atomic reasoning facets. For each facet, we assess evidence sufficiency and grounding using a structured Facet x Chunk matrix that combines retrieval relevance with natural language inference-based faithfulness scores. To diagnose evidence usage, we analyze three controlled inference modes: Strict RAG, which enforces exclusive reliance on retrieved evidence; Soft RAG, which allows integration of retrieved evidence and parametric knowledge; and LLM-only generation without retrieval. Comparing these modes enables thorough analysis of retrieval-generation misalignment, defined as cases where relevant evidence is retrieved but not correctly integrated during generation. Across medical QA and HotpotQA, we evaluate three open-source and closed-source LLMs (GPT, Gemini, and LLaMA), providing interpretable diagnostics that reveal recurring facet-level failure modes, including evidence absence, evidence misalignment, and prior-driven overrides. Our results demonstrate that hallucinations in RAG systems are driven less by retrieval accuracy and more by how retrieved evidence is integrated during generation, with facet-level analysis exposing systematic evidence override and misalignment patterns that remain hidden under answer-level evaluation.