Towards Responsible Multimodal Medical Reasoning via Context-Aligned Vision-Language Models

arXiv cs.CV / 4/13/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that medical vision-language models can give fluent but poorly grounded diagnostic conclusions when they over-rely on a dominant modality.
It proposes a context-aligned multimodal reasoning framework that augments a frozen VLM with structured contextual signals (e.g., radiomic statistics, explainability activations, and vocabulary-grounded semantic cues) and verifies agreement across heterogeneous clinical evidence before answering.
The method shifts outputs from free-form text to structured reports that include supporting evidence, calibrated uncertainty, limitations, and safety notes.
Experiments on chest X-ray datasets show improved discriminative performance (AUC 0.918→0.925), reduced hallucinated keywords (1.14→0.25), and shorter reasoning explanations (19.4→15.3 words) without increasing overconfidence.
Cross-dataset results (e.g., CheXpert) indicate that the informativeness of each modality affects the model’s reasoning behavior, highlighting the importance of context alignment for trustworthy medical multimodal reasoning.

Abstract

Medical vision-language models (VLMs) show strong performance on radiology tasks but often produce fluent yet weakly grounded conclusions due to over-reliance on a dominant modality. We introduce a context-aligned reasoning framework that enforces agreement across heterogeneous clinical evidence before generating diagnostic conclusions. The proposed approach augments a frozen VLM with structured contextual signals derived from radiomic statistics, explainability activations, and vocabulary-grounded semantic cues. Instead of producing free-form responses, the model generates structured outputs containing supporting evidence, uncertainty estimates, limitations, and safety notes. We observe that auxiliary signals alone provide limited benefit; performance gains emerge only when these signals are integrated through contextual verification. Experiments on chest X-ray datasets demonstrate that context alignment improves discriminative performance (AUC 0.918 to 0.925) while maintaining calibrated uncertainty. The framework also substantially reduces hallucinated keywords (1.14 to 0.25) and produces more concise reasoning explanations (19.4 to 15.3 words) without increasing model confidence (0.70 to 0.68). Cross-dataset evaluation on CheXpert further reveals that modality informativeness significantly influences reasoning behavior. These results suggest that enforcing multi-evidence agreement improves both reliability and trustworthiness in medical multimodal reasoning, while preserving the underlying model architecture.