Less Is More? Selective Visual Attention to High-Importance Regions for Multimodal Radiology Summarization

arXiv cs.CL / 4/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that multimodal radiology summarization can underperform text-only baselines because models are overwhelmed by visual noise and do not improve the FINDINGS→IMPRESSION transformation meaningfully.
It challenges the assumption that “more images is better” by showing—via controlled ablations on MIMIC-CXR—that selectively attending to pathology-relevant patches improves results compared with using full images.
The authors introduce ViTAS (Visual-Text Attention Summarizer), a multi-stage pipeline that uses ensemble-guided MedSAM2 lung segmentation, bidirectional cross-attention for multi-view fusion, Shapley-guided adaptive patch clustering, and hierarchical visual tokenization into a ViT.
ViTAS reports state-of-the-art performance on the benchmark, with improvements in overlap metrics (29.25% BLEU-4, 69.83% ROUGE-L) and better factual alignment in qualitative assessment.
Human evaluation further supports the approach, with the model achieving the highest expert-rated scores, reinforcing that “less but more relevant” visual input can be superior for this task.

Abstract

Automated radiology report summarization aims to distill verbose findings into concise clinical impressions, but existing multimodal models often struggle with visual noise and fail to meaningfully improve over strong text-only baselines in the FINDINGS

\to

IMPRESSION transformation. We challenge two prevailing assumptions: (1) that more visual input is always better, and (2) that multimodal models add limited value when findings already contain rich image-derived detail. Through controlled ablations on MIMIC-CXR benchmark, we show that selectively focusing on pathology-relevant visual patches rather than full images yields substantially better performance. We introduce ViTAS, Visual-Text Attention Summarizer, a multi-stage pipeline that combines ensemble-guided MedSAM2 lung segmentation, bidirectional cross-attention for multi-view fusion, Shapley-guided adaptive patch clustering, and hierarchical visual tokenization feeding a ViT. ViTAS achieves SOTA results with 29.25% BLEU-4 and 69.83% ROUGE-L, improved factual alignment in qualitative analysis, and the highest expert-rated human evaluation scores. Our findings demonstrate that less but more relevant visual input is not only sufficient but superior for multimodal radiology summarization.