Disentangling Prompt Element Level Risk Factors for Hallucinations and Omissions in Mental Health LLM Responses

arXiv cs.CL / 4/3/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes UTCO (User, Topic, Context, Tone), a prompt-construction framework to systematically stress-test mental-health LLM responses using controllable inquiry elements rather than static benchmark sets.
  • In experiments with 2,075 UTCO-generated prompts, hallucinations were observed in 6.5% of responses and omissions in 13.2%, indicating omission errors are a substantial and safety-relevant failure mode.
  • Omission failures were especially concentrated in prompts involving crisis and suicidal ideation, highlighting elevated risk in high-distress scenarios.
  • Across multiple evaluation approaches (regression, element-specific matching, and similarity-matched comparisons), the most consistent predictors of failures were the prompt’s context and tone rather than user-background indicators.
  • The authors argue that evaluations should treat omissions as a primary safety outcome and broaden coverage beyond underrepresented narrative, high-distress inquiries.

Abstract

Mental health concerns are often expressed outside clinical settings, including in high-distress help seeking, where safety-critical guidance may be needed. Consumer health informatics systems increasingly incorporate large language models (LLMs) for mental health question answering, yet many evaluations underrepresent narrative, high-distress inquiries. We introduce UTCO (User, Topic, Context, Tone), a prompt construction framework that represents an inquiry as four controllable elements for systematic stress testing. Using 2,075 UTCO-generated prompts, we evaluated Llama 3.3 and annotated hallucinations (fabricated or incorrect clinical content) and omissions (missing clinically necessary or safety-critical guidance). Hallucinations occurred in 6.5% of responses and omissions in 13.2%, with omissions concentrated in crisis and suicidal ideation prompts. Across regression, element-specific matching, and similarity-matched comparisons, failures were most consistently associated with context and tone, while user-background indicators showed no systematic differences after balancing. These findings support evaluating omissions as a primary safety outcome and moving beyond static benchmark question sets.