Enhancing Reinforcement Learning for Radiology Report Generation with Evidence-aware Rewards and Self-correcting Preference Learning

arXiv cs.LG / 4/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes Evidence-aware Self-Correcting Reinforcement Learning (ESC-RL) to improve radiology report generation by addressing weak, report-level reward signals for clinical faithfulness.
  • ESC-RL introduces GEAR (Group-wise Evidence-aware Alignment Reward), which provides group-wise feedback to reinforce true positives, recover false negatives, and suppress unsupported false positives.
  • It also adds SPL (Self-correcting Preference Learning) that builds a disease-aware preference dataset from multiple noisy observations and uses an LLM to synthesize refined reports without human supervision.
  • Experiments on two public chest X-ray datasets show consistent performance improvements and state-of-the-art results, suggesting ESC-RL yields more evidence-grounded and preference-aligned outputs during training.

Abstract

Recent reinforcement learning (RL) approaches have advanced radiology report generation (RRG), yet two core limitations persist: (1) report-level rewards offer limited evidence-grounded guidance for clinical faithfulness; and (2) current methods lack an explicit self-improving mechanism to align with clinical preference. We introduce clinically aligned Evidence-aware Self-Correcting Reinforcement Learning (ESC-RL), comprising two key components. First, a Group-wise Evidence-aware Alignment Reward (GEAR) delivers group-wise, evidence-aware feedback. GEAR reinforces consistent grounding for true positives, recovers missed findings for false negatives, and suppresses unsupported content for false positives. Second, a Self-correcting Preference Learning (SPL) strategy automatically constructs a reliable, disease-aware preference dataset from multiple noisy observations and leverages an LLM to synthesize refined reports without human supervision. ESC-RL promotes clinically faithful, disease-aligned reward and supports continual self-improvement during training. Extensive experiments on two public chest X-ray datasets demonstrate consistent gains and state-of-the-art performance.