FinReflectKG -- HalluBench: GraphRAG Hallucination Benchmark for Financial Question Answering Systems

arXiv cs.CL / 3/24/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces FinBench-QA-Hallucination, a benchmark to evaluate hallucination detection in KG-augmented financial QA systems using SEC 10-K filings.
The dataset includes 755 annotated examples with a conservative groundedness protocol that requires evidence from both textual chunks and extracted relational triplets.
Six hallucination detection approaches are tested (LLM judges, fine-tuned classifiers, NLI models, span detectors, and embedding-based methods) under scenarios with and without KG triplets.
Results show strong performance for LLM judges and embedding approaches on clean KG data (F1 around 0.82–0.86), but most methods sharply degrade under noisy triplets (MCC dropping 44–84%), while embedding methods are comparatively robust (~9% degradation).
The study emphasizes reliability risks for compliance- and risk-focused financial deployments and proposes a framework for integrating AI reliability evaluation into high-stakes information system design.

Abstract

As organizations increasingly integrate AI-powered question-answering systems into financial information systems for compliance, risk assessment, and decision support, ensuring the factual accuracy of AI-generated outputs becomes a critical engineering challenge. Current Knowledge Graph (KG)-augmented QA systems lack systematic mechanisms to detect hallucinations - factually incorrect outputs that undermine reliability and user trust. We introduce FinBench-QA-Hallucination, a benchmark for evaluating hallucination detection methods in KG-augmented financial QA over SEC 10-K filings. The dataset contains 755 annotated examples from 300 pages, each labeled for groundedness using a conservative evidence-linkage protocol requiring support from both textual chunks and extracted relational triplets. We evaluate six detection approaches - LLM judges, fine-tuned classifiers, Natural Language Inference (NLI) models, span detectors, and embedding-based methods under two conditions: with and without KG triplets. Results show that LLM-based judges and embedding approaches achieve the highest performance (F1: 0.82-0.86) under clean conditions. However, most methods degrade significantly when noisy triplets are introduced, with Matthews Correlation Coefficient (MCC) dropping 44-84 percent, while embedding methods remain relatively robust with only 9 percent degradation. Statistical tests (Cochran's Q and McNemar) confirm significant performance differences (p < 0.001). Our findings highlight vulnerabilities in current KG-augmented systems and provide insights for building reliable financial information systems, where hallucinations can lead to regulatory violations and flawed decisions. The benchmark also offers a framework for integrating AI reliability evaluation into information system design across other high-stakes domains such as healthcare, legal, and government.

Black Hat Asia

AI Business

"The Agent Didn't Decide Wrong. The Instructions Were Conflicting — and Nobody Noticed."

Dev.to

Top 5 LLM Gateway Alternatives After the LiteLLM Supply Chain Attack

Dev.to

Stop Counting Prompts — Start Reflecting on AI Fluency

Dev.to

Reliable Function Calling in Deeply Recursive Union Types: Fixing Qwen Models' Double-Stringify Bug

Dev.to

FinReflectKG -- HalluBench: GraphRAG Hallucination Benchmark for Financial Question Answering Systems

Key Points

Abstract

Related Articles

Black Hat Asia

"The Agent Didn't Decide Wrong. The Instructions Were Conflicting — and Nobody Noticed."

Top 5 LLM Gateway Alternatives After the LiteLLM Supply Chain Attack

Stop Counting Prompts — Start Reflecting on AI Fluency

Reliable Function Calling in Deeply Recursive Union Types: Fixing Qwen Models' Double-Stringify Bug

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer