Can LLMs Score Medical Diagnoses and Clinical Reasoning as well as Expert Panels?
arXiv cs.LG / 4/17/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The study tests a “LLM jury” of three frontier AI models to score 3,333 medical diagnoses from real hospital cases, aiming to replace costly, slow expert clinician panels as an adjudicator.
- The LLM jury evaluates not only the final diagnosis but also differential diagnosis, clinical reasoning, and negative treatment risk, with results benchmarked against expert panels and a separate human re-scoring panel.
- Findings show the uncalibrated LLM jury scores are systematically lower than clinician panel scores, but it preserves ordinal agreement, matches expert panel rankings very well, and has a lower probability of severe safety errors than the human re-scoring panel.
- The research also finds that combining LLM-jury scoring with AI-generated diagnoses can flag high-risk ward diagnoses for targeted expert review, improving panel efficiency.
- Post-hoc calibration using isotonic regression significantly improves alignment between the calibrated LLM jury and human expert panel evaluations, and the models show no self-preference bias toward diagnoses generated by their own underlying model or same-vendor models.



