Can "AI" Be a Doctor? A Study of Empathy, Readability, and Alignment in Clinical LLMs

arXiv cs.CL / 4/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The study evaluates how well general-purpose and clinical LLMs align with clinical communication standards by measuring semantic fidelity, readability, and affective resonance in both structured explanations and real physician–patient dialogue.
  • Baseline models tend to be more affectively extreme than physicians and often increase linguistic complexity, with some larger models showing significantly higher FKGL scores than physician-authored responses.
  • Empathy-oriented prompting can reduce extreme negativity and lower readability complexity, but it does not meaningfully improve semantic fidelity to physicians’ clinical content.
  • Collaborative rewriting produces the strongest overall alignment, while rephrasing achieves the highest semantic similarity and also improves readability and emotional tone.
  • Dual stakeholder evaluation finds no model outperforms physicians on epistemic criteria, and patients consistently prefer rewritten variants for clarity and emotional tone, suggesting LLMs should support clinical communication rather than replace expertise.

Abstract

Large Language Models (LLMs) are increasingly deployed in healthcare, yet their communicative alignment with clinical standards remains insufficiently quantified. We conduct a multidimensional evaluation of general-purpose and domain-specialized LLMs across structured medical explanations and real-world physician-patient interactions, analyzing semantic fidelity, readability, and affective resonance. Baseline models amplify affective polarity relative to physicians (Very Negative: 43.14-45.10% vs. 37.25%) and, in larger architectures such as GPT-5 and Claude, produce substantially higher linguistic complexity (FKGL up to 16.91-17.60 vs. 11.47-12.50 in physician-authored responses). Empathy-oriented prompting reduces extreme negativity and lowers grade-level complexity (up to -6.87 FKGL points for GPT-5) but does not significantly increase semantic fidelity. Collaborative rewriting yields the strongest overall alignment. Rephrase configurations achieve the highest semantic similarity to physician answers (up to mean = 0.93) while consistently improving readability and reducing affective extremity. Dual stakeholder evaluation shows that no model surpasses physicians on epistemic criteria, whereas patients consistently prefer rewritten variants for clarity and emotional tone. These findings suggest that LLMs function most effectively as collaborative communication enhancers rather than replacements for clinical expertise.