Is There Knowledge Left to Extract? Evidence of Fragility in Medically Fine-Tuned Vision-Language Models

arXiv cs.CV / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study evaluates four paired open-source vision-language models (LLaVA vs. LLaVA-Med; Gemma vs. MedGemma) on progressively harder medical imaging tasks (brain tumors, pneumonia, skin cancer, histopathology) to test whether fine-tuning supports genuine clinical reasoning.
Results show performance collapses toward near-random accuracy as difficulty increases, suggesting the models largely rely on superficial visual cues rather than robust reasoning.
Domain-specific medical fine-tuning does not produce a consistent benefit across tasks, and the models are highly sensitive to small prompt changes that significantly swing both accuracy and refusal rates.
A description-based pipeline using the VLM to generate image descriptions, then a text-only model (GPT-5.1) to diagnose, recovers only a limited additional signal and still hits the same difficulty ceiling.
Embedding-level analysis indicates failures arise from both insufficient visual representations and weak downstream reasoning, concluding that current medical VLM performance is fragile and not reliably improved by fine-tuning.

Abstract

Vision-language models (VLMs) are increasingly adapted through domain-specific fine-tuning, yet it remains unclear whether this improves reasoning beyond superficial visual cues, particularly in high-stakes domains like medicine. We evaluate four paired open-source VLMs (LLaVA vs. LLaVA-Med; Gemma vs. MedGemma) across four medical imaging tasks of increasing difficulty: brain tumor, pneumonia, skin cancer, and histopathology classification. We find that performance degrades toward near-random levels as task difficulty increases, indicating limited clinical reasoning. Medical fine-tuning provides no consistent advantage, and models are highly sensitive to prompt formulation, with minor changes causing large swings in accuracy and refusal rates. To test whether closed-form VQA suppresses latent knowledge, we introduce a description-based pipeline where models generate image descriptions that a text-only model (GPT-5.1) uses for diagnosis. This recovers a limited additional signal but remains bounded by task difficulty. Analysis of vision encoder embeddings further shows that failures stem from both weak visual representations and downstream reasoning. Overall, medical VLM performance is fragile, prompt-dependent, and not reliably improved by domain-specific fine-tuning.

Reported ban on ‘sex robots’ by online platform fuels debate on AI boundaries and content moderation

Reddit r/artificial

FastAPI With LangChain and MongoDB

Dev.to

Best AI Game Creator in 2026

Dev.to

Smart AI Recruiter Assistant with OpenClaw

Dev.to

🌱 Green Habit Tracker

Dev.to

Is There Knowledge Left to Extract? Evidence of Fragility in Medically Fine-Tuned Vision-Language Models

Key Points

Abstract

Related Articles

Reported ban on ‘sex robots’ by online platform fuels debate on AI boundaries and content moderation

FastAPI With LangChain and MongoDB

Best AI Game Creator in 2026

Smart AI Recruiter Assistant with OpenClaw

🌱 Green Habit Tracker

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer