FinReflectKG -- HalluBench: GraphRAG Hallucination Benchmark for Financial Question Answering Systems
arXiv cs.CL / 3/24/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces FinBench-QA-Hallucination, a benchmark to evaluate hallucination detection in KG-augmented financial QA systems using SEC 10-K filings.
- The dataset includes 755 annotated examples with a conservative groundedness protocol that requires evidence from both textual chunks and extracted relational triplets.
- Six hallucination detection approaches are tested (LLM judges, fine-tuned classifiers, NLI models, span detectors, and embedding-based methods) under scenarios with and without KG triplets.
- Results show strong performance for LLM judges and embedding approaches on clean KG data (F1 around 0.82–0.86), but most methods sharply degrade under noisy triplets (MCC dropping 44–84%), while embedding methods are comparatively robust (~9% degradation).
- The study emphasizes reliability risks for compliance- and risk-focused financial deployments and proposes a framework for integrating AI reliability evaluation into high-stakes information system design.
Related Articles

Black Hat Asia
AI Business

"The Agent Didn't Decide Wrong. The Instructions Were Conflicting — and Nobody Noticed."
Dev.to
Top 5 LLM Gateway Alternatives After the LiteLLM Supply Chain Attack
Dev.to

Stop Counting Prompts — Start Reflecting on AI Fluency
Dev.to

Reliable Function Calling in Deeply Recursive Union Types: Fixing Qwen Models' Double-Stringify Bug
Dev.to