Using LLM-as-a-Judge/Jury to Advance Scalable, Clinically-Validated Safety Evaluations of Model Responses to Users Demonstrating Psychosis

arXiv cs.AI / 4/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that LLMs used for mental health support may pose elevated risks for people with psychosis, since models can reinforce delusions and hallucinations.
  • It proposes a clinically grounded evaluation approach by developing and validating seven clinician-informed safety criteria specifically targeting psychosis-related harms.
  • The authors create a human-consensus dataset and then test automated safety assessment using an LLM as a judge, as well as an ensemble majority-vote approach (“LLM-as-a-Jury”).
  • Results show strong alignment between LLM-as-a-Judge and human consensus (Cohen’s kappa up to 0.75), with the best single judge slightly outperforming the jury method.
  • Overall, the findings suggest that LLM-as-a-Judge can enable scalable, clinically validated safety evaluations for mental-health LLM responses.

Abstract

General-purpose Large Language Models (LLMs) are becoming widely adopted by people for mental health support. Yet emerging evidence suggests there are significant risks associated with high-frequency use, particularly for individuals suffering from psychosis, as LLMs may reinforce delusions and hallucinations. Existing evaluations of LLMs in mental health contexts are limited by a lack of clinical validation and scalability of assessment. To address these issues, this research focuses on psychosis as a critical condition for LLM safety evaluation by (1) developing and validating seven clinician-informed safety criteria, (2) constructing a human-consensus dataset, and (3) testing automated assessment using an LLM as an evaluator (LLM-as-a-Judge) or taking the majority vote of several LLM judges (LLM-as-a-Jury). Results indicate that LLM-as-a-Judge aligns closely with the human consensus (Cohen's \kappa_{\text{human} \times \text{gemini}} = 0.75, \kappa_{\text{human} \times \text{qwen}} = 0.68, \kappa_{\text{human} \times \text{kimi}} = 0.56) and that the best judge slightly outperforms LLM-as-a-Jury (Cohen's \kappa_{\text{human} \times \text{jury}} = 0.74). Overall, these findings have promising implications for clinically grounded, scalable methods in LLM safety evaluations for mental health contexts.