Reliability Auditing for Downstream LLM tasks in Psychiatry: LLM-Generated Hospitalization Risk Scores

arXiv cs.AI / 4/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper examines how LLMs used for psychiatric clinical reasoning produce reliable hospitalization risk scores, focusing on interpretive stability in an uncertain domain.
  • It introduces a reliability-auditing framework that tests two factors: how different prompt framings affect outputs and how adding medically insignificant inputs changes predicted risk.
  • Using synthetic patient profiles (n=50) with clinically relevant features plus up to 50 clinically insignificant features, evaluated across four prompt reframings, the study audits four LLMs: Gemini 2.5 Flash, LLaMa 3.3 70b, Claude Sonnet 4.6, and GPT-4o mini.
  • Across all models and prompt styles, adding medically insignificant variables significantly increases both the absolute mean predicted hospitalization risk and the variability of outputs, indicating reduced predictive stability.
  • Prompt variations and the presence of clinically insignificant features independently and in model-dependent ways drive instability, underscoring the need for systematic uncertainty and attribution stability evaluations before clinical deployment.

Abstract

Large language models (LLMs) are increasingly utilized in clinical reasoning and risk assessment. However, their interpretive reliability in critical and indeterminate domains such as psychiatry remains unclear. Prior work has identified algorithmic biases and prompt sensitivity in these systems, raising concerns about how contextual information may influence model outputs, but there remains no systematic way to assess these, especially in the psychiatric domain. We propose an approach for reliability auditing downstream LLM tasks by structuring evaluation around the impact of prompt design and the inclusion of medically insignificant inputs on predicted hospitalization risk scores, which is often the first downstream AI clinical-decision-making task. In our audit, a cohort of synthetic patient profiles (n = 50) is generated, each consisting of 15 clinically relevant features and up to 50 clinically insignificant features, across four prompt reframings (neutral, logical, human impact, clinical judgment). We audit four LLMs (Gemini 2.5 Flash, LLaMa 3.3 70b, Claude Sonnet 4.6, GPT-4o mini), and our results show that including medically insignificant variables resulted in a statistically significant increase in the absolute mean predicted hospitalization risk and output variability across all models and prompts, indicating reduced predictive stability as contextual noise increased. Clinically insignificant features had an effect on instability across many model-prompt conditions, and prompt variations independently affected the trajectory of instability in a model-dependent manner. These findings quantify how LLM-based psychiatric risk assessments are sensitive to non-clinical information, highlighting the need for systematic evaluations of attributional stability and uncertainty behavior like this before clinical deployments.