VERT: Reliable LLM Judges for Radiology Report Evaluation
arXiv cs.AI / 4/7/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces VERT, an LLM-based metric for evaluating radiology reports, addressing uncertainty about how well prior LLM-judge approaches generalize across different imaging modalities and anatomies.
- It performs a comprehensive correlation study between expert radiologist ratings and LLM judge outputs, comparing RadFact, GREEN, FineRadScore, and VERT using open/closed-source models with varying sizes and reasoning capabilities.
- Experiments on the RadEval and RaTE-Eval datasets evaluate few-shot prompting, ensembling, and parameter-efficient fine-tuning (with RaTE-Eval as a focus) to determine effective judge configurations.
- Results indicate VERT improves correlation with radiologist judgments by up to 11.7% relative to GREEN, and that fine-tuning Qwen3 30B can achieve up to 25% gains with only 1,300 samples.
- The study also includes systematic error analysis to characterize where LLM metrics align or diverge from expert judgments and reports that fine-tuning can reduce inference time by up to 37.2×.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Black Hat Asia
AI Business

Fully Automated Website 2026-04-11: **The Scoreboard — Visual Judge Score Comparison on the Homepage**
Dev.to
Human-Aligned Decision Transformers for satellite anomaly response operations with ethical auditability baked in
Dev.to

That Smoking-Gun Video? It's Not Evidence. It's a Suspect.
Dev.to

AI Citation Registries and Website-Based Publishing Constraints
Dev.to