AI Navigate

Prototype-Based Knowledge Guidance for Fine-Grained Structured Radiology Reporting

arXiv cs.AI / 3/13/2026

📰 NewsModels & Research

Key Points

  • ProtoSR integrates free-text derived knowledge into structured radiology reporting by using a multimodal knowledge base of visual prototypes aligned with the reporting template.
  • The approach automatically extracts knowledge from 80k+ MIMIC-CXR studies using an instruction-tuned LLM to populate the knowledge base.
  • ProtoSR retrieves relevant prototypes for a given image-question pair and augments predictions with a prototype-conditioned residual, acting as a data-driven second opinion.
  • On the Rad-ReStruct benchmark, ProtoSR achieves state-of-the-art results, with the largest gains for detailed attribute questions, demonstrating the value of leveraging unstructured text signals for fine-grained image understanding.

Abstract

Structured radiology reporting promises faster, more consistent communication than free text, but automation remains difficult as models must make many fine-grained, discrete decisions about rare findings and attributes from limited structured supervision. In contrast, free-text reports are produced at scale in routine care and implicitly encode fine-grained, image-linked information through detailed descriptions. To leverage this unstructured knowledge, we propose ProtoSR, an approach for injecting free-text information into structured report population. First, we introduce an automatic extraction pipeline that uses an instruction-tuned LLM to mine 80k+ MIMIC-CXR studies and build a multimodal knowledge base aligned with a structured reporting template, representing each answer option with a visual prototype. Using this knowledge base, ProtoSR is trained to retrieve prototypes relevant for the current image-question pair and augment the model predictions through a prototype-conditioned residual, providing a data-driven second opinion that selectively corrects predictions. On the Rad-ReStruct benchmark, ProtoSR achieves state-of-the-art results, with the largest improvements on detailed attribute questions, demonstrating the value of integrating free-text derived signal for fine-grained image understanding.