Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems

arXiv cs.AI / 4/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that existing “LLM-as-a-judge” evaluations for multi-step self-harm/depression screening lack reliability estimates and cannot explain how errors compound across multiple LLM judgments, making them less suitable for safety-critical use.
It proposes a statistical framework for multi-agent LLM pipelines represented as DAGs, modeling each agent as a stochastic categorical decision and replacing heuristic voting with adaptive, principled decision-making.
The method adds agent-level performance confidence bounds, a bandit-based adaptive sampling strategy that adjusts based on input difficulty, and regret guarantees with logarithmic error growth in deployment.
Experiments on two behavioral-health datasets (AEGIS 2.0, N=161; SWMH Reddit stratified sample, N=250) show notably lower false positive rates, improving precision without increasing false negatives, and reducing incorrect flagging of safe content by about 40% on AEGIS 2.0.
Overall, the results indicate that adaptive sampling can meaningfully improve reliability/precision in behavioral health risk screening while maintaining recall in the evaluated setting.

Abstract

Emerging AI systems in behavioral health and psychiatry use multi-step or multi-agent LLM pipelines for tasks like assessing self-harm risk and screening for depression. However, common evaluation approaches, like LLM-as-a-judge, do not indicate when a decision is reliable or how errors may accumulate across multiple LLM judgements, limiting their suitability for safety-critical settings. We present a statistical framework for multi-agent pipelines structured as directed acyclic graphs (DAGs) that provides an alternative to heuristic voting with principled, adaptive decision-making. We model each agent as a stochastic categorical decision and introduce (1) tighter agent-level performance confidence bounds, (2) a bandit-based adaptive sampling strategy based on input difficulty, and (3) regret guarantees over the multi-agent system that shows logarithmic error growth when deployed. We evaluate our system on two labeled datasets in behavioral health : the AEGIS 2.0 behavioral health subset (N=161) and a stratified sample of SWMH Reddit posts (N=250). Empirically, our adaptive sampling strategy achieves the lowest false positive rate of any condition across both datasets, 0.095 on AEGIS 2.0 compared to 0.159 for single-agent models, reducing incorrect flagging of safe content by 40\% and still having similar false negative rates across all conditions. These results suggest that principled adaptive sampling offers a meaningful improvement in precision without reducing recall in this setting.