Semantic Context-aware mOdality fUsion Transformer (SCOUT): A Context-Aware Multimodal Transformer for Concept-Grounded Pathology Report Generation

arXiv cs.CV / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces SCOUT, a context-aware multimodal Transformer designed to generate clinically grounded pathology reports from whole-slide images (WSIs) rather than relying only on fluent but concept-uncharted generation.
  • SCOUT progressively conditions visual representations using both global slide context and explicit diagnostic concepts, integrating local histological patterns, slide-level architecture, and expert-curated semantic descriptors in a unified learning framework.
  • The method improves interpretability and clinical coherence by dynamically refining image features during encoding and using depth-aware contextual modulation plus adaptive multimodal fusion during text generation.
  • Experiments using CONCH1.5 features show SCOUT outperforms prior approaches (WSI-Caption, HistGen, and BiGen) across multiple benchmarks, achieving top BLEU-1 to BLEU-4 and METEOR scores and the best ROUGE-L on selected datasets.
  • On TCGA-BRCA, SCOUT reports strong metric gains (e.g., BLEU-1/2/3/4 and METEOR), and it also delivers high scores on REG 2025, supporting the effectiveness of progressive contextual conditioning for concept-grounded pathology report generation.

Abstract

Whole-slide images (WSIs) present a fundamental challenge for computational pathology due to their extreme resolution, multi-scale heterogeneity, and the requirement for clinically reliable interpretation. Although recent pathology foundation models have enabled fluent report generation, they often lack clinical grounding, failing to accurately represent key diagnostic concepts and relationships observed by pathologists. This limitation arises from the difficulty of integrating heterogeneous visual evidence spanning fine-grained cellular patterns, slide-level tissue architecture, and high-level diagnostic concepts, while maintaining interpretability and clinical coherence. Here we present SCOUT: Semantic Context-aware mOdality fUsion Transformer, a context-aware concept-grounded multimodal framework for pathology report generation that enables progressive conditioning of image representations by global slide information and explicit diagnostic concepts. The method integrates local histological patterns, whole-slide context, and expert-curated semantic descriptors within a unified learning paradigm, allowing visual features to be dynamically refined throughout the encoding process. By combining depth-aware contextual modulation with adaptive multimodal fusion during text generation, the framework produces clinically coherent reports while preserving complementarity across representational scales. Using CONCH1.5 features, we evaluate SCOUT against WSI-Caption, HistGen, and BiGen on TCGA-BRCA, MICCAI REG, and HistAI. SCOUT achieves the best BLEU-1 to BLEU-4 and METEOR scores on all datasets, plus the best ROUGE-L on TCGA-BRCA and MICCAI REG. On TCGA-BRCA, it reaches 0.436/0.303/0.202/0.156 BLEU-1/2/3/4 and 0.204 METEOR; on REG 2025, it achieves 0.865/0.834/0.805/0.780 and 0.568. These results support progressive contextual conditioning for grounded pathology report generation.