Blending Human and LLM Expertise to Detect Hallucinations and Omissions in Mental Health Chatbot Responses

arXiv cs.CL / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper finds that common LLM-as-a-judge approaches perform poorly on mental health counseling data, reaching only about 52% accuracy and sometimes near-zero recall for hallucination detection.
  • It attributes the weakness to LLM judges’ inability to capture the nuanced linguistic and therapeutic patterns that human domain experts rely on for safety-critical evaluation.
  • The authors propose a human+LLM framework that extracts interpretable, domain-informed features across five dimensions: logical consistency, entity verification, factual accuracy, linguistic uncertainty, and professional appropriateness.
  • Experiments using both a public mental health dataset and a new human-annotated dataset show that traditional ML models trained on these features achieve stronger hallucination detection (0.717 F1 on the custom set; 0.849 F1 on a benchmark) but more modest omission detection performance (0.59–0.64 F1).
  • Overall, the work argues that combining domain expertise with structured automated evaluation is more reliable and transparent than relying on black-box LLM judging for high-stakes mental health chatbot use.

Abstract

As LLM-powered chatbots are increasingly deployed in mental health services, detecting hallucinations and omissions has become critical for user safety. However, state-of-the-art LLM-as-a-judge methods often fail in high-risk healthcare contexts, where subtle errors can have serious consequences. We show that leading LLM judges achieve only 52% accuracy on mental health counseling data, with some hallucination detection approaches exhibiting near-zero recall. We identify the root cause as LLMs' inability to capture nuanced linguistic and therapeutic patterns recognized by domain experts. To address this, we propose a framework that integrates human expertise with LLMs to extract interpretable, domain-informed features across five analytical dimensions: logical consistency, entity verification, factual accuracy, linguistic uncertainty, and professional appropriateness. Experiments on a public mental health dataset and a new human-annotated dataset show that traditional machine learning models trained on these features achieve 0.717 F1 on our custom dataset and 0.849 F1 on a public benchmark for hallucination detection, with 0.59-0.64 F1 for omission detection across both datasets. Our results demonstrate that combining domain expertise with automated methods yields more reliable and transparent evaluation than black-box LLM judging in high-stakes mental health applications.

Blending Human and LLM Expertise to Detect Hallucinations and Omissions in Mental Health Chatbot Responses | AI Navigate