MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors
arXiv cs.CL / 4/9/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- MedDialBench is introduced as a benchmark for measuring how LLM diagnostic robustness changes under parametric, non-cooperative patient behaviors with graded severity levels and case-specific scripts.
- The benchmark decomposes patient non-cooperation into five behavior dimensions—Logic Consistency, Health Cognition, Expression Style, Disclosure, and Attitude—to enable dose-response and factorial cross-dimension interaction analysis.
- Across evaluations of five frontier LLMs over 7,225 dialogues, the study finds a strong asymmetry: “information pollution” (fabricating symptoms) causes much larger accuracy degradation than “information deficit” (withholding information).
- Fabricating symptoms is the only adversarial configuration that shows statistically significant accuracy drops across all five models, and it produces super-additive failure when combined with other fabricating-involving dimension pairs.
- Models show distinct vulnerability profiles, with worst-case accuracy drops of roughly 38.8–54.1 percentage points, and exhaustive questioning can mitigate deficit but not recover from fabricated inputs.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Black Hat Asia
AI Business

Title: We Built an AI That Remembers Why Your Codebase Is the Way It Is
Dev.to

Building EchoKernel: A Voice-Controlled AI Agent That Actually Does Things
Dev.to

Agent Diary: Apr 12, 2026 - The Day I Became a Perfect Zero (While Run 238 Writes About Achieving Absolute Nothingness)
Dev.to

A Black-Box Framework for Evaluating Trust in AI Agents
Dev.to