Toward Clinically Acceptable Chest X-ray Report Generation: A Qualitative Retrospective Pilot Study of CXRMate-2

arXiv cs.CV / 4/22/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces CXRMate-2, a chest X-ray radiology report generation model that uses structured multimodal conditioning and reinforcement learning with a composite reward for semantic alignment to radiologist reports.
  • Across multiple datasets (MIMIC-CXR, CheXpert Plus, ReXgradient), CXRMate-2 shows statistically significant gains over strong benchmarks, including notable improvements on MIMIC-CXR relative to MedGemma 1.5 (4B).
  • In a blinded, randomized qualitative retrospective comparison using 120 MIMIC-CXR test studies, generated reports were considered acceptable in 45% of ratings, with no significant preference difference for seven of eight analyzed findings.
  • Radiologist preference was mainly driven by higher recall, whereas generated reports were often favored for readability, suggesting strengths and remaining gaps for clinical use.
  • The authors conclude that improved recall and detection of subtle findings may make CX RRG suitable for prospective evaluation in assistive roles within radiologist-led workflows.

Abstract

Chest X-ray (CXR) radiology report generation (RRG) models have shown rapid progress, yet their clinical utility remains uncertain due to limited evaluation by radiologists. We present CXRMate-2, a state-of-the-art CXR RRG model that integrates structured multimodal conditioning and reinforcement learning with a composite reward for semantic alignment with radiologist reports. Across the MIMIC-CXR, CheXpert Plus, and ReXgradient datasets, CXRMate-2 achieves statistically significant improvements over strong benchmarks, including gains of 11.2% and 24.4% in GREEN and RadGraph-XL, respectively, on MIMIC-CXR relative to MedGemma 1.5 (4B). To directly compare CXRMate-2 against radiologist reporting, we conduct a blinded, randomised qualitative retrospective evaluation. Three consultant radiologists compare generated and radiologist reports across 120 studies from the MIMIC-CXR test set. Generated reports were deemed acceptable (defined as preferred or rated equally to radiologist reports) in 45% of ratings, with no statistically significant difference in preference rates between radiologist reports and acceptable generated reports for seven of the eight analysed findings. Preference for radiologist reports was driven primarily by higher recall, while generated reports were often preferred for readability. Together, these results suggest a credible pathway to clinically acceptable CXR RRG. Improvements in recall, alongside better detection of subtle findings (e.g., pulmonary congestion), are likely sufficient to achieve non-inferiority to radiologist reporting. With these targeted advances, CXR RRG systems may be ready for prospective evaluation in assistive roles within radiologist-led workflows.