Reliability Auditing for Downstream LLM tasks in Psychiatry: LLM-Generated Hospitalization Risk Scores
arXiv cs.AI / 4/27/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper examines how LLMs used for psychiatric clinical reasoning produce reliable hospitalization risk scores, focusing on interpretive stability in an uncertain domain.
- It introduces a reliability-auditing framework that tests two factors: how different prompt framings affect outputs and how adding medically insignificant inputs changes predicted risk.
- Using synthetic patient profiles (n=50) with clinically relevant features plus up to 50 clinically insignificant features, evaluated across four prompt reframings, the study audits four LLMs: Gemini 2.5 Flash, LLaMa 3.3 70b, Claude Sonnet 4.6, and GPT-4o mini.
- Across all models and prompt styles, adding medically insignificant variables significantly increases both the absolute mean predicted hospitalization risk and the variability of outputs, indicating reduced predictive stability.
- Prompt variations and the presence of clinically insignificant features independently and in model-dependent ways drive instability, underscoring the need for systematic uncertainty and attribution stability evaluations before clinical deployment.
Related Articles

Subagents: The Building Block of Agentic AI
Dev.to

DeepSeek-V4 Models Could Change Global AI Race
AI Business

Got OpenAI's privacy filter model running on-device via ExecuTorch
Reddit r/LocalLLaMA

The Agent-Skill Illusion: Why Prompt-Based Control Fails in Multi-Agent Business Consulting Systems
Dev.to

We Built a Voice AI Receptionist in 8 Weeks — Every Decision We Made and Why
Dev.to