CLEAR: Revealing How Noise and Ambiguity Degrade Reliability in LLMs for Medicine

arXiv cs.CL / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The CLEAR framework is proposed to measure how ambiguity and uncertainty in decision-space design affect medical LLM reliability, going beyond simplified exam-style benchmarks.
The evaluation systematically perturbs the number of plausible answer options, whether ground-truth/abstention options exist, and how answer options are semantically framed.
Applying CLEAR across three medical benchmarks and 17 LLMs shows that more plausible options reduce both correct-answer selection and safe abstention from wrong answers.
Reliability drops further when abstention is framed as uncertainty (“I don’t know”) rather than assertive rejection (“None of the Above”), and simply adding an “I don’t know” option can increase incorrect selections.
The paper introduces a “humility deficit” that quantifies the gap between choosing correct answers and abstaining from incorrect ones, and finds this gap worsens as model size increases.

Abstract

Medical large language model (LLM) evaluations rely on simplified, exam-style benchmarks that rarely reflect the ambiguity of real-world medical inquiries. We introduce the CLinical Evaluation of Ambiguity and Reliability (CLEAR) framework, which assesses how decision-space presentation, ambiguity, and uncertainty affect LLMs' reasoning on medical benchmarks. CLEAR systematically perturbs (1) the number of plausible answer options, (2) the presence of a ground truth or abstention option, and (3) the semantic framing of answer options. Applying CLEAR on three benchmarks evaluated across 17 LLMs reveals three notable limitations of existing evaluation methods. First, increasing the number of plausible answers degrades a model's ability to identify the correct answer and abstain against incorrect ones. Second, this lack of caution intensifies as the framing of abstention shifts from assertive rejection like "None of the Above" to uncertainty admission like "I don't know" (IDK). Notably, just including IDK in the answer space increases incorrect answer selections. Lastly, we formalize the performance gap between identifying the correct answer and abstaining from incorrect ones as the humility deficit, which worsens with model scale. Our findings reveal limitations in standard medical benchmarks and underscore that scaling alone does not resolve LLM reliability issues.