Advancing AI Trustworthiness Through Patient Simulation: Risk Assessment of Conversational Agents for Antidepressant Selection

arXiv cs.CL / 3/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes a patient simulator that generates realistic, controllable healthcare conversations to evaluate conversational agents at scale for risk assessment across populations.
The simulator is built around NIST AI Risk Management Framework concepts and combines medical profiles from All of Us EHR data, linguistic profiles tied to health literacy, and behavioral profiles (cooperative, distracted, adversarial).
In 500 simulations assessing an AI decision aid for antidepressant selection, performance degraded monotonically as health literacy decreased, with concept retrieval varying from 47.6% (limited) to 81.9% (proficient).
Medical concept fidelity was high (96.6%) with strong human and LLM-judge agreement (kappa values of 0.73 and 0.78), while behavioral profile classification was also reliable (0.93 kappa) and linguistic profile agreement was moderate (0.61 kappa).
The study concludes that health literacy is a primary, measurable risk factor for conversational healthcare AI, implying the need for more equitable deployment and evaluation practices.

Abstract

Objective: This paper introduces a patient simulator for scalable, automated evaluation of healthcare conversational agents, generating realistic, controllable interactions that systematically vary across medical, linguistic, and behavioral dimensions to support risk assessment across populations. Methods: Grounded in the NIST AI Risk Management Framework, the simulator integrates three profile components: (1) medical profiles constructed from All of Us electronic health records using risk-ratio gating; (2) linguistic profiles modeling health literacy and condition-specific communication; and (3) behavioral profiles representing cooperative, distracted, and adversarial engagement. Profiles were evaluated against NIST AI RMF trustworthiness requirements and assessed against an AI Decision Aid for antidepressant selection. Results: Across 500 simulated conversations, the simulator revealed monotonic degradation in AI Decision Aid performance across health literacy levels: Rank-1 concept retrieval ranged from 47.6% (limited) to 81.9% (proficient), with corresponding recommendation degradation. Medical concept fidelity was high (96.6% across 8,210 concepts), validated by human annotators (0.73 kappa) and an LLM judge with comparable agreement (0.78 kappa). Behavioral profiles were reliably distinguished (0.93 kappa), and linguistic profiles showed moderate agreement (0.61 kappa). Conclusions: The simulator exposes measurable performance risks in conversational healthcare AI. Health literacy emerged as a primary risk factor with direct implications for equitable AI deployment.