CT-FineBench: A Diagnostic Fidelity Benchmark for Fine-Grained Evaluation of CT Report Generation
arXiv cs.AI / 4/28/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- CT report generation is difficult to evaluate because conventional metrics only provide coarse checks (e.g., lexical overlap), missing fine-grained diagnostic correctness needed for clinical use.
- The paper introduces CT-FineBench, a QA-based benchmark built from CT-RATE and Merlin that focuses on fine-grained factual consistency across disease-oriented clinical attributes.
- CT-FineBench construction extracts structured finding-specific attributes (such as location, size, and margin) and converts them into a QA dataset grounded in gold-standard reports.
- In evaluation, a generated report is queried with this QA set and answers are scored, enabling more interpretable detection of specific clinical errors.
- Experiments indicate CT-FineBench correlates more strongly with expert clinical assessment and is far more sensitive to fine-grained factual mistakes than prior metrics.
Related Articles

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Everyone Wants AI Agents. Fewer Teams Are Ready for the Messy Business Context Behind Them
Dev.to
AI 编程工具对比 2026:Claude Code vs Cursor vs Gemini CLI vs Codex
Dev.to

How I Improved My YouTube Shorts and Podcast Audio Workflow with AI Tools
Dev.to

An improvement of the convergence proof of the ADAM-Optimizer
Dev.to