Can "AI" Be a Doctor? A Study of Empathy, Readability, and Alignment in Clinical LLMs

arXiv cs.CL / 4/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study evaluates how well general-purpose and clinical LLMs align with clinical communication standards by measuring semantic fidelity, readability, and affective resonance in both structured explanations and real physician–patient dialogue.
Baseline models tend to be more affectively extreme than physicians and often increase linguistic complexity, with some larger models showing significantly higher FKGL scores than physician-authored responses.
Empathy-oriented prompting can reduce extreme negativity and lower readability complexity, but it does not meaningfully improve semantic fidelity to physicians’ clinical content.
Collaborative rewriting produces the strongest overall alignment, while rephrasing achieves the highest semantic similarity and also improves readability and emotional tone.
Dual stakeholder evaluation finds no model outperforms physicians on epistemic criteria, and patients consistently prefer rewritten variants for clarity and emotional tone, suggesting LLMs should support clinical communication rather than replace expertise.

Abstract

Large Language Models (LLMs) are increasingly deployed in healthcare, yet their communicative alignment with clinical standards remains insufficiently quantified. We conduct a multidimensional evaluation of general-purpose and domain-specialized LLMs across structured medical explanations and real-world physician-patient interactions, analyzing semantic fidelity, readability, and affective resonance. Baseline models amplify affective polarity relative to physicians (Very Negative: 43.14-45.10% vs. 37.25%) and, in larger architectures such as GPT-5 and Claude, produce substantially higher linguistic complexity (FKGL up to 16.91-17.60 vs. 11.47-12.50 in physician-authored responses). Empathy-oriented prompting reduces extreme negativity and lowers grade-level complexity (up to -6.87 FKGL points for GPT-5) but does not significantly increase semantic fidelity. Collaborative rewriting yields the strongest overall alignment. Rephrase configurations achieve the highest semantic similarity to physician answers (up to mean = 0.93) while consistently improving readability and reducing affective extremity. Dual stakeholder evaluation shows that no model surpasses physicians on epistemic criteria, whereas patients consistently prefer rewritten variants for clarity and emotional tone. These findings suggest that LLMs function most effectively as collaborative communication enhancers rather than replacements for clinical expertise.

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Trajectory Forecasts in Unknown Environments Conditioned on Grid-Based Plans

Dev.to

Why use an AI gateway at all?

Dev.to

OpenAI Just Named It Workspace Agents. We Open-Sourced Our Lark Version Six Months Ago

Dev.to

GPT Image 2 Subject-Lock Editing: A Practical Guide to input_fidelity

Dev.to

Can "AI" Be a Doctor? A Study of Empathy, Readability, and Alignment in Clinical LLMs

Key Points

Abstract

Related Articles

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Trajectory Forecasts in Unknown Environments Conditioned on Grid-Based Plans

Why use an AI gateway at all?

OpenAI Just Named It Workspace Agents. We Open-Sourced Our Lark Version Six Months Ago

GPT Image 2 Subject-Lock Editing: A Practical Guide to input_fidelity

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer