Consistent but Dangerous: Per-Sample Safety Classification Reveals False Reliability in Medical Vision-Language Models

arXiv cs.CV / 3/24/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that using paraphrase consistency as a proxy for reliability in medical vision-language models is fundamentally flawed because models can remain perfectly consistent while ignoring the input image and relying on text patterns.
  • It introduces a four-quadrant per-sample safety taxonomy (Ideal, Fragile, Dangerous, Worst) that evaluates both consistency across paraphrased prompts and whether predictions depend on the image.
  • Experiments on five medical VLM configurations over two chest X-ray datasets (MIMIC-CXR and PadChest) show that LoRA fine-tuning can sharply reduce prediction flip rates while moving most samples into the “Dangerous” category, indicating false reliability.
  • “Dangerous” samples can still be highly accurate (up to 99.6%) with low entropy, meaning confidence-based screening may miss the image-ignoring failure mode.
  • The authors recommend deployment evaluations combine consistency checks with a text-only baseline (e.g., an additional forward pass without the image) to detect this trap efficiently.

Abstract

Consistency under paraphrase, the property that semantically equivalent prompts yield identical predictions, is increasingly used as a proxy for reliability when deploying medical vision-language models (VLMs). We show this proxy is fundamentally flawed: a model can achieve perfect consistency by relying on text patterns rather than the input image. We introduce a four-quadrant per-sample safety taxonomy that jointly evaluates consistency (stable predictions across paraphrased prompts) and image reliance (predictions that change when the image is removed). Samples are classified as Ideal (consistent and image-reliant), Fragile (inconsistent but image-reliant), Dangerous (consistent but not image-reliant), or Worst (inconsistent and not image-reliant). Evaluating five medical VLM configurations across two chest X-ray datasets (MIMIC-CXR, PadChest), we find that LoRA fine-tuning dramatically reduces flip rates but shifts a majority of samples into the Dangerous quadrant: LLaVA-Rad Base achieves a 1.5% flip rate on PadChest while 98.5% of its samples are Dangerous. Critically, Dangerous samples exhibit high accuracy (up to 99.6%) and low entropy, making them invisible to standard confidence-based screening. We observe a negative correlation between flip rate and Dangerous fraction (r = -0.89, n=10) and recommend that deployment evaluations always pair consistency checks with a text-only baseline: a single additional forward pass that exposes the false reliability trap.