Ran Score: a LLM-based Evaluation Score for Radiology Report Generation

arXiv cs.AI / 3/25/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Ran Score, an LLM-based, finding-level evaluation metric for radiology report generation that targets challenges like low-prevalence abnormality recognition and clinically important language (negation/ambiguity).
  • It proposes a clinician-guided framework that combines human expertise with large language model prompting to perform multi-label finding extraction from free-text chest X-ray reports.
  • Using three non-overlapping MIMIC-CXR-EN cohorts plus an independent ChestX-CN validation cohort, the authors optimize prompts and derive radiologist-based reference labels to assess report generation models.
  • The optimized approach increases the macro-averaged score from 0.753 to 0.956 on the MIMIC-CXR-EN development cohort and outperforms the CheXbert benchmark by 15.7 percentage points on comparable labels.
  • Results show robust generalization to ChestX-CN and suggest Ran Score can improve fidelity evaluation, especially for detecting low-prevalence abnormalities.

Abstract

Chest X-ray report generation and automated evaluation are limited by poor recognition of low-prevalence abnormalities and inadequate handling of clinically important language, including negation and ambiguity. We develop a clinician-guided framework combining human expertise and large language models for multi-label finding extraction from free-text chest X-ray reports and use it to define Ran Score, a finding-level metric for report evaluation. Using three non-overlapping MIMIC-CXR-EN cohorts from a public chest X-ray dataset and an independent ChestX-CN validation cohort, we optimize prompts, establish radiologist-derived reference labels and evaluate report generation models. The optimized framework improves the macro-averaged score from 0.753 to 0.956 on the MIMIC-CXR-EN development cohort, exceeds the CheXbert benchmark by 15.7 percentage points on directly comparable labels, and shows robust generalization on the ChestX-CN validation cohort. Here we show that clinician-guided prompt optimization improves agreement with a radiologist-derived reference standard and that Ran Score enables finding-level evaluation of report fidelity, particularly for low-prevalence abnormalities.