FinReflectKG -- HalluBench: GraphRAG Hallucination Benchmark for Financial Question Answering Systems

arXiv cs.CL / 3/24/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces FinBench-QA-Hallucination, a benchmark to evaluate hallucination detection in KG-augmented financial QA systems using SEC 10-K filings.
  • The dataset includes 755 annotated examples with a conservative groundedness protocol that requires evidence from both textual chunks and extracted relational triplets.
  • Six hallucination detection approaches are tested (LLM judges, fine-tuned classifiers, NLI models, span detectors, and embedding-based methods) under scenarios with and without KG triplets.
  • Results show strong performance for LLM judges and embedding approaches on clean KG data (F1 around 0.82–0.86), but most methods sharply degrade under noisy triplets (MCC dropping 44–84%), while embedding methods are comparatively robust (~9% degradation).
  • The study emphasizes reliability risks for compliance- and risk-focused financial deployments and proposes a framework for integrating AI reliability evaluation into high-stakes information system design.

Abstract

As organizations increasingly integrate AI-powered question-answering systems into financial information systems for compliance, risk assessment, and decision support, ensuring the factual accuracy of AI-generated outputs becomes a critical engineering challenge. Current Knowledge Graph (KG)-augmented QA systems lack systematic mechanisms to detect hallucinations - factually incorrect outputs that undermine reliability and user trust. We introduce FinBench-QA-Hallucination, a benchmark for evaluating hallucination detection methods in KG-augmented financial QA over SEC 10-K filings. The dataset contains 755 annotated examples from 300 pages, each labeled for groundedness using a conservative evidence-linkage protocol requiring support from both textual chunks and extracted relational triplets. We evaluate six detection approaches - LLM judges, fine-tuned classifiers, Natural Language Inference (NLI) models, span detectors, and embedding-based methods under two conditions: with and without KG triplets. Results show that LLM-based judges and embedding approaches achieve the highest performance (F1: 0.82-0.86) under clean conditions. However, most methods degrade significantly when noisy triplets are introduced, with Matthews Correlation Coefficient (MCC) dropping 44-84 percent, while embedding methods remain relatively robust with only 9 percent degradation. Statistical tests (Cochran's Q and McNemar) confirm significant performance differences (p < 0.001). Our findings highlight vulnerabilities in current KG-augmented systems and provide insights for building reliable financial information systems, where hallucinations can lead to regulatory violations and flawed decisions. The benchmark also offers a framework for integrating AI reliability evaluation into information system design across other high-stakes domains such as healthcare, legal, and government.