PSI-Bench: Towards Clinically Grounded and Interpretable Evaluation of Depression Patient Simulators

arXiv cs.CL / 4/29/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • PSI-Benchは、うつ病患者シミュレータの挙動を「解釈可能で臨床的に根拠のある」形で、ターン/対話/集団レベルの多面的に評価する自動評価フレームワークです。
  • 既存評価がLLMジャッジと不明確なプロンプトに依存し、行動多様性の検証が不足している点を補うことを狙っています。
  • PSI-Benchを用いたベンチマークでは、7つのLLMでシミュレータが応答を長く・語彙的に多様にしつつも変動性が下がり、感情の解決が速すぎることや、否定→肯定へ一様に推移する傾向が見られました。
  • モデル規模よりも、シミュレーションの枠組み(シミュレータ実装側)の方が忠実度(fidelity)への影響が大きいことが示され、人手による評価でも専門家の判断と強く整合する結果になりました。

Abstract

Patient simulators are gaining traction in mental health training by providing scalable exposure to complex and sensitive patient interactions. Simulating depressed patients is particularly challenging, as safety constraints and high patient variability complicate simulations and underscore the need for simulators that capture diverse and realistic patient behaviors. However, existing evaluations heavily rely on LLM-judges with poorly specified prompts and do not assess behavioral diversity. We introduce PSI-Bench, an automatic evaluation framework that provides interpretable, clinically grounded diagnostics of depression patient simulator behavior across turn-, dialogue-, and population-level dimensions. Using PSI-Bench, we benchmark seven LLMs across two simulator frameworks and find that simulators produce overly long, lexically diverse responses, show reduced variability, resolve emotions too quickly, and follow a uniform negative-to-positive trajectory. We also show that the simulation framework has a larger impact on fidelity than the model scale. Results from a human study demonstrate that our benchmark is strongly aligned with expert judgments. Our work reveals key limitations of current depression patient simulators and provides an interpretable, extensible benchmark to guide future simulator design and evaluation.