Is There Knowledge Left to Extract? Evidence of Fragility in Medically Fine-Tuned Vision-Language Models
arXiv cs.CV / 4/14/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The study evaluates four paired open-source vision-language models (LLaVA vs. LLaVA-Med; Gemma vs. MedGemma) on progressively harder medical imaging tasks (brain tumors, pneumonia, skin cancer, histopathology) to test whether fine-tuning supports genuine clinical reasoning.
- Results show performance collapses toward near-random accuracy as difficulty increases, suggesting the models largely rely on superficial visual cues rather than robust reasoning.
- Domain-specific medical fine-tuning does not produce a consistent benefit across tasks, and the models are highly sensitive to small prompt changes that significantly swing both accuracy and refusal rates.
- A description-based pipeline using the VLM to generate image descriptions, then a text-only model (GPT-5.1) to diagnose, recovers only a limited additional signal and still hits the same difficulty ceiling.
- Embedding-level analysis indicates failures arise from both insufficient visual representations and weak downstream reasoning, concluding that current medical VLM performance is fragile and not reliably improved by fine-tuning.



