Ran Score: a LLM-based Evaluation Score for Radiology Report Generation

arXiv cs.AI / 3/25/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces Ran Score, an LLM-based, finding-level evaluation metric for radiology report generation that targets challenges like low-prevalence abnormality recognition and clinically important language (negation/ambiguity).
It proposes a clinician-guided framework that combines human expertise with large language model prompting to perform multi-label finding extraction from free-text chest X-ray reports.
Using three non-overlapping MIMIC-CXR-EN cohorts plus an independent ChestX-CN validation cohort, the authors optimize prompts and derive radiologist-based reference labels to assess report generation models.
The optimized approach increases the macro-averaged score from 0.753 to 0.956 on the MIMIC-CXR-EN development cohort and outperforms the CheXbert benchmark by 15.7 percentage points on comparable labels.
Results show robust generalization to ChestX-CN and suggest Ran Score can improve fidelity evaluation, especially for detecting low-prevalence abnormalities.

Abstract

Chest X-ray report generation and automated evaluation are limited by poor recognition of low-prevalence abnormalities and inadequate handling of clinically important language, including negation and ambiguity. We develop a clinician-guided framework combining human expertise and large language models for multi-label finding extraction from free-text chest X-ray reports and use it to define Ran Score, a finding-level metric for report evaluation. Using three non-overlapping MIMIC-CXR-EN cohorts from a public chest X-ray dataset and an independent ChestX-CN validation cohort, we optimize prompts, establish radiologist-derived reference labels and evaluate report generation models. The optimized framework improves the macro-averaged score from 0.753 to 0.956 on the MIMIC-CXR-EN development cohort, exceeds the CheXbert benchmark by 15.7 percentage points on directly comparable labels, and shows robust generalization on the ChestX-CN validation cohort. Here we show that clinician-guided prompt optimization improves agreement with a radiologist-derived reference standard and that Ran Score enables finding-level evaluation of report fidelity, particularly for low-prevalence abnormalities.

Santa Augmentcode Intent Ep.6

Dev.to

Your Agent Hired Another Agent. The Output Was Garbage. The Money's Gone.

Dev.to

ClawRouter vs TeamoRouter: one requires a crypto wallet, one doesn't

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Palantir’s billionaire CEO says only two kinds of people will succeed in the AI era: trade workers — ‘or you’re neurodivergent’

Reddit r/artificial

Ran Score: a LLM-based Evaluation Score for Radiology Report Generation

Key Points

Abstract

Related Articles

Santa Augmentcode Intent Ep.6

Your Agent Hired Another Agent. The Output Was Garbage. The Money's Gone.

ClawRouter vs TeamoRouter: one requires a crypto wallet, one doesn't

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Palantir’s billionaire CEO says only two kinds of people will succeed in the AI era: trade workers — ‘or you’re neurodivergent’

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer