Beyond Literal Summarization: Redefining Hallucination for Medical SOAP Note Evaluation

arXiv cs.AI / 4/17/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • Existing evaluation methods for medical LLMs (e.g., automated metrics or “LLM-as-judge”) often treat any information absent from the transcript as hallucination, even when it is clinically valid.
  • The study finds that many hallucination flags actually represent legitimate clinical transformations such as synonym normalization, abstraction of exam findings, diagnostic inference, and guideline-consistent care planning.
  • When evaluation is adjusted to account for clinical reasoning (via calibrated prompting and retrieval grounded in medical ontologies), results change substantially.
  • Under a lexical regime, the mean hallucination rate is reported as 35%, but it drops to 9% with inference-aware evaluation, leaving a smaller set of cases tied to genuine safety issues.
  • The paper argues that current practices may over-penalize valid reasoning and end up measuring artifacts of the evaluation design rather than true model errors, highlighting the need for clinically informed evaluation in medicine.

Abstract

Evaluating large language models (LLMs) for clinical documentation tasks such as SOAP note generation remains challenging. Unlike standard summarization, these tasks require clinical abstraction, normalization of colloquial language, and medically grounded inference. However, prevailing evaluation methods including automated metrics and LLM as judge frameworks rely on lexical faithfulness, often labeling any information not explicitly present in the transcript as hallucination. We show that such approaches systematically misclassify clinically valid outputs as errors, inflating hallucination rates and distorting model assessment. Our analysis reveals that many flagged hallucinations correspond to legitimate clinical transformations, including synonym mapping, abstraction of examination findings, diagnostic inference, and guideline consistent care planning. By aligning evaluation criteria with clinical reasoning through calibrated prompting and retrieval grounded in medical ontologies we observe a significant shift in outcomes. Under a lexical evaluation regime, the mean hallucination rate is 35%, heavily penalizing valid reasoning. With inference aware evaluation, this drops to 9%, with remaining cases reflecting genuine safety concerns. These findings suggest that current evaluation practices over penalize valid clinical reasoning and may measure artifacts of evaluation design rather than true errors, underscoring the need for clinically informed evaluation in high context domains like medicine.