VeriSim: A Configurable Framework for Evaluating Medical AI Under Realistic Patient Noise
arXiv cs.AI / 4/14/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- The paper introduces VeriSim, a configurable, truth-preserving patient simulation framework that injects clinically grounded patient communication noise (e.g., recall gaps, low health literacy, anxiety) into medical LLM evaluations.
- VeriSim maintains medical ground truth via a hybrid UMLS–LLM verification mechanism and implements six evidence-derived noise dimensions to better reflect real clinical interactions.
- Experiments on seven open-weight medical LLMs show substantial performance degradation under realistic patient noise, including a 15–25% drop in diagnostic accuracy and a 34–55% increase in conversation length.
- The study finds smaller models (7B) degrade more (about 40% greater) than larger models (70B+), and standard medical fine-tuning on conventional corpora provides limited robustness against communication noise.
- The framework is evaluated by board-certified clinicians with strong annotation agreement (kappa > 0.80), and LLM-as-a-judge is validated as a scalable auxiliary evaluator; VeriSim is released as open source.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.




