CT-FineBench: A Diagnostic Fidelity Benchmark for Fine-Grained Evaluation of CT Report Generation

arXiv cs.AI / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • CT report generation is difficult to evaluate because conventional metrics only provide coarse checks (e.g., lexical overlap), missing fine-grained diagnostic correctness needed for clinical use.
  • The paper introduces CT-FineBench, a QA-based benchmark built from CT-RATE and Merlin that focuses on fine-grained factual consistency across disease-oriented clinical attributes.
  • CT-FineBench construction extracts structured finding-specific attributes (such as location, size, and margin) and converts them into a QA dataset grounded in gold-standard reports.
  • In evaluation, a generated report is queried with this QA set and answers are scored, enabling more interpretable detection of specific clinical errors.
  • Experiments indicate CT-FineBench correlates more strongly with expert clinical assessment and is far more sensitive to fine-grained factual mistakes than prior metrics.

Abstract

The evaluation of generated reports remains a critical challenge in Computed Tomography (CT) report generation, due to the large volume of text, the diversity and complexity of findings, and the presence of fine-grained, disease-oriented attributes. Conventional evaluation metrics offer only coarse measures of lexical overlap or entity matching and fail to reflect the granular diagnostic accuracy required for clinical use. To address this gap, we propose CT-FineBench, a benchmark built from CT-RATE and Merlin to evaluate the fine-grained factual consistency of CT reports, constructed from CT-RATE and Merlin. Our benchmark is constructed through a meticulous, Question-Answering (QA) based process: first, we identify and structure key, finding-specific clinical attributes (like location, size, margin). Second, we systematically transform these attributes into a QA dataset, where questions probe for specific clinical details grounded in gold-standard reports. The evaluation protocol for CT-FineBench involves using this QA dataset to query a machine-generated report and scoring the correctness of the answers. This allows for a comprehensive, interpretable, and clinically-relevant assessment, moving beyond superficial lexical overlap to pinpoint specific clinical errors. Experiments show that CT-FineBench correlates better with expert clinical assessment and is substantially more sensitive to fine-grained factual errors than prior metrics.