Predictive Entropy Links Calibration and Paraphrase Sensitivity in Medical Vision-Language Models
arXiv cs.LG / 4/13/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that two key safety risks in medical vision-language models—miscalibrated confidence and sensitivity to question rephrasing—share a common mechanism tied to proximity to the decision boundary.
- By benchmarking five uncertainty quantification methods on MedGemma 4BIT across in-distribution MIMIC CXR and out-of-distribution PadChest (with cross-architecture validation on LLaVA RAD7B), the authors show that predictive entropy from a single forward pass can predict both error likelihood and which samples will flip under paraphrase changes.
- Results indicate predictive entropy enables a single entropy threshold to flag both unreliable predictions and paraphrase-sensitive predictions, achieving AUROC around 0.711 on MedGemma and 0.878 on LLaVA RAD7B.
- The study finds that a five-member LoRA ensemble degrades calibration and accuracy under dataset shift (MIMIC→PadChest), while the LLaVA RAD ensemble is more robust.
- Among single-model methods, MC Dropout shows the best calibration (lowest ECE reported) and selective prediction coverage, but predictive entropy still outperforms the ensemble for both error detection AUROC and paraphrase screening.
Related Articles

Black Hat Asia
AI Business

Apple is building smart glasses without a display to serve as an AI wearable
THE DECODER

Why Fashion Trend Prediction Isn’t Enough Without Generative AI
Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Chatbot vs Voicebot: The Real Business Decision Nobody Talks About
Dev.to