Predictive Entropy Links Calibration and Paraphrase Sensitivity in Medical Vision-Language Models

arXiv cs.LG / 4/13/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that two key safety risks in medical vision-language models—miscalibrated confidence and sensitivity to question rephrasing—share a common mechanism tied to proximity to the decision boundary.
By benchmarking five uncertainty quantification methods on MedGemma 4BIT across in-distribution MIMIC CXR and out-of-distribution PadChest (with cross-architecture validation on LLaVA RAD7B), the authors show that predictive entropy from a single forward pass can predict both error likelihood and which samples will flip under paraphrase changes.
Results indicate predictive entropy enables a single entropy threshold to flag both unreliable predictions and paraphrase-sensitive predictions, achieving AUROC around 0.711 on MedGemma and 0.878 on LLaVA RAD7B.
The study finds that a five-member LoRA ensemble degrades calibration and accuracy under dataset shift (MIMIC→PadChest), while the LLaVA RAD ensemble is more robust.
Among single-model methods, MC Dropout shows the best calibration (lowest ECE reported) and selective prediction coverage, but predictive entropy still outperforms the ensemble for both error detection AUROC and paraphrase screening.

Abstract

Medical Vision Language Models VLMs suffer from two failure modes that threaten safe deployment mis calibrated confidence and sensitivity to question rephrasing. We show they share a common cause, proximity to the decision boundary, by benchmarking five uncertainty quantification methods on MedGemma 4BIT across in distribution MIMIC CXR and outof distribution PadChest chest X ray datasets, with cross architecture validation on LLaVA RAD7B. For well calibrated single model methods, predictive entropy from one forward pass predicts which samples will flip under rephrasing AUROC 0.711 on MedGemma, 0.878 on LLaVARAD p 10 4, enabling a single entropy threshold to flag both unreliable and rephrase sensitive predictions. A five member LoRA ensemble fails under the MIMIC PadChest shift 42.9 ECE, 34.1 accuracy, though LLaVA RAD s ensemble does not collapse 69.1. MC Dropout achieves the best calibration ECE 4.3 and selective prediction coverage 21.5 at 5 risk, yet total entropy from a single forward pass outperforms the ensemble for both error detection AUROC 0.743 vs 0.657 and paraphrase screening. Simple methods win.