Blending Human and LLM Expertise to Detect Hallucinations and Omissions in Mental Health Chatbot Responses

arXiv cs.CL / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper finds that common LLM-as-a-judge approaches perform poorly on mental health counseling data, reaching only about 52% accuracy and sometimes near-zero recall for hallucination detection.
It attributes the weakness to LLM judges’ inability to capture the nuanced linguistic and therapeutic patterns that human domain experts rely on for safety-critical evaluation.
The authors propose a human+LLM framework that extracts interpretable, domain-informed features across five dimensions: logical consistency, entity verification, factual accuracy, linguistic uncertainty, and professional appropriateness.
Experiments using both a public mental health dataset and a new human-annotated dataset show that traditional ML models trained on these features achieve stronger hallucination detection (0.717 F1 on the custom set; 0.849 F1 on a benchmark) but more modest omission detection performance (0.59–0.64 F1).
Overall, the work argues that combining domain expertise with structured automated evaluation is more reliable and transparent than relying on black-box LLM judging for high-stakes mental health chatbot use.

Abstract

As LLM-powered chatbots are increasingly deployed in mental health services, detecting hallucinations and omissions has become critical for user safety. However, state-of-the-art LLM-as-a-judge methods often fail in high-risk healthcare contexts, where subtle errors can have serious consequences. We show that leading LLM judges achieve only 52% accuracy on mental health counseling data, with some hallucination detection approaches exhibiting near-zero recall. We identify the root cause as LLMs' inability to capture nuanced linguistic and therapeutic patterns recognized by domain experts. To address this, we propose a framework that integrates human expertise with LLMs to extract interpretable, domain-informed features across five analytical dimensions: logical consistency, entity verification, factual accuracy, linguistic uncertainty, and professional appropriateness. Experiments on a public mental health dataset and a new human-annotated dataset show that traditional machine learning models trained on these features achieve 0.717 F1 on our custom dataset and 0.849 F1 on a public benchmark for hallucination detection, with 0.59-0.64 F1 for omission detection across both datasets. Our results demonstrate that combining domain expertise with automated methods yields more reliable and transparent evaluation than black-box LLM judging in high-stakes mental health applications.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/9DailyView insight →

Black Hat Asia

AI Business

Amazon CEO takes aim at Nvidia, Intel, Starlink, more in annual shareholder letter

TechCrunch

Why Anthropic’s new model has cybersecurity experts rattled

Reddit r/artificial

Does the AI 2027 paper still hold any legitimacy?

Reddit r/artificial

Why Most Productivity Systems Fail (And What to Do Instead)

Dev.to

Blending Human and LLM Expertise to Detect Hallucinations and Omissions in Mental Health Chatbot Responses

Key Points

Abstract

💡 Insights using this article

Related Articles

Black Hat Asia

Amazon CEO takes aim at Nvidia, Intel, Starlink, more in annual shareholder letter

Why Anthropic’s new model has cybersecurity experts rattled

Does the AI 2027 paper still hold any legitimacy?

Why Most Productivity Systems Fail (And What to Do Instead)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer