A Scoping Review of LLM-as-a-Judge in Healthcare and the MedJUDGE Framework

arXiv cs.AI / 4/30/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper surveys how LLM-as-a-Judge (LaaJ) is being used in healthcare to evaluate clinical text and reduce reliance on costly expert review, while highlighting safety and bias risks.
  • A PRISMA-ScR scoping review of 49 studies (from 6 databases, Jan 2020–Jan 2026) finds that the field is mostly focused on evaluation/benchmarking, pointwise scoring, and GPT-family “judge” models.
  • The review finds weak validation rigor: many studies use minimal or no human expert validators, rarely test for bias, and generally do not assess demographic fairness, temporal stability, or patient context.
  • The authors argue that gaps can compound—especially when judges and evaluated systems share training data or architectures—so agreement metrics may miss systematic, clinically significant errors.
  • To address these issues, the authors propose MedJUDGE, a risk-stratified three-pillar framework for validity, safety, and accountability to guide deployment-oriented evaluation of healthcare LaaJ systems.

Abstract

As large language models (LLMs) increasingly generate and process clinical text, scalable evaluation has become critical. LLM-as-a-Judge (LaaJ), which uses LLMs to evaluate model outputs, offers a scalable alternative to costly expert review, but its healthcare adoption raises safety and bias concerns. We conducted a PRISMA-ScR scoping review of six databases (January 2020-January 2026), screening 11,727 studies and including 49. The landscape was dominated by evaluation and benchmarking applications (n=37, 75.5%), pointwise scoring (n=42, 85.7%), and GPT-family judges (n=36, 73.5%). Despite growing adoption, validation rigor was limited: among 36 studies with human involvement, the median number of expert validators was 3, while 13 (26.5%) used none. Risk of bias testing was absent in 36 studies (73.5%), only 1 (2.0%) examined demographic fairness, and none assessed temporal stability or patient context. Deployment remained limited, with 1 study (2.0%) reaching production and four (8.2%) prototype stage. Importantly, these gaps may interact: when judges and evaluated systems share training data or architectures, they may inherit similar blind spots, and agreement metrics may fail to distinguish true validity from shared errors. Minimal human oversight, limited bias assessment, and model monoculture together represent a governance gap where current validation may miss clinically significant errors. To address this, we propose MedJUDGE (Medical Judge Utility, De-biasing, Governance and Evaluation), a risk-stratified three-pillar framework organized around validity, safety, and accountability across clinical risk tiers, providing deployment-oriented evaluation guidance for healthcare LaaJ systems.