VeriSim: A Configurable Framework for Evaluating Medical AI Under Realistic Patient Noise

arXiv cs.AI / 4/14/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper introduces VeriSim, a configurable, truth-preserving patient simulation framework that injects clinically grounded patient communication noise (e.g., recall gaps, low health literacy, anxiety) into medical LLM evaluations.
  • VeriSim maintains medical ground truth via a hybrid UMLS–LLM verification mechanism and implements six evidence-derived noise dimensions to better reflect real clinical interactions.
  • Experiments on seven open-weight medical LLMs show substantial performance degradation under realistic patient noise, including a 15–25% drop in diagnostic accuracy and a 34–55% increase in conversation length.
  • The study finds smaller models (7B) degrade more (about 40% greater) than larger models (70B+), and standard medical fine-tuning on conventional corpora provides limited robustness against communication noise.
  • The framework is evaluated by board-certified clinicians with strong annotation agreement (kappa > 0.80), and LLM-as-a-judge is validated as a scalable auxiliary evaluator; VeriSim is released as open source.

Abstract

Medical large language models (LLMs) achieve impressive performance on standardized benchmarks, yet these evaluations fail to capture the complexity of real clinical encounters where patients exhibit memory gaps, limited health literacy, anxiety, and other communication barriers. We introduce VeriSim, a truth-preserving patient simulation framework that injects controllable, clinically evidence-grounded noise into patient responses while maintaining strict adherence to medical ground truth through a hybrid UMLS-LLM verification mechanism. Our framework operationalizes six noise dimensions derived from peer-reviewed medical communication literature, capturing authentic clinical phenomena such as patient recall limitations, health literacy barriers, and stigma-driven non-disclosure. Experiments across seven open-weight LLMs reveal that all models degrade significantly under realistic patient noise, with diagnostic accuracy dropping 15-25% and conversation length increasing 34-55%. Notably, smaller models (7B) show 40% greater degradation than larger models (70B+), while medical fine-tuning on standard corpora provides limited robustness benefits against patient communication noise. Evaluation by board-certified clinicians demonstrates high-quality simulation with strong inter-annotator agreement (kappa > 0.80), while LLM-as-a-Judge serves as a validated auxiliary evaluator achieving comparable reliability for scalable assessment. Our results highlight a critical Sim-to-Real gap in current medical AI. We release VeriSim as an open-source noise-injection framework, establishing a rigorous testbed for evaluating clinical robustness.