Auditing Frontier Vision-Language Models for Trustworthy Medical VQA: Grounding Failures, Format Collapse, and Domain Adaptation
arXiv cs.AI / 5/1/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper audits five recent frontier/grounding-aware vision-language models on Medical VQA, finding uniformly weak localization of anatomical and pathological targets (best mean IoU is only 0.23) along with clinically risky laterality confusion.
- In a two-step self-grounding pipeline (localize first, then answer with the same model), VQA accuracy drops for every model due to both inaccurate localization and severe format-compliance/parsing failures.
- When the system replaces predicted bounding boxes with ground-truth annotations, VQA accuracy recovers and improves, indicating the core failure is in the perception/localization module rather than the question-answer decomposition strategy.
- As a follow-up for domain adaptation, supervised fine-tuning of Qwen 2.5 VL on combined Med-VQA data yields the best reported SLAKE open-ended recall (85.5%) among comparable methods, but whether this fully fixes the trustworthiness bottleneck remains open.
- Overall, the study identifies grounding quality (bounding-box localization reliability) as a primary bottleneck for trustworthy clinical deployment of VLMs under realistic failure conditions.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works
Dev.to

Text-to-image is easy. Chaining LLMs to generate, critique, and iterate on images autonomously is a routing nightmare. AgentSwarms now supports Image generation playground and creative media workflows!
Reddit r/artificial

Why Enterprise AI Pilots Fail
Dev.to

Announcing the NVIDIA Nemotron 3 Super Build Contest
Dev.to

75% of Sites Blocking AI Bots Still Get Cited. Here Is Why Blocking Does Not Work.
Dev.to