A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations
arXiv cs.CL / 3/27/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The study introduces CPGBench, an automated benchmark framework to evaluate how well LLMs detect and adhere to clinical practice guidelines (CPGs) in multi-turn conversations.
- Using 3,418 CPG documents across 24 specialties from 9 regions/2 organizations, the authors extract 32,155 recommendations and generate one multi-turn conversation per recommendation to test 8 leading LLMs.
- Results show a detection gap: 71.1%–89.6% of recommendations are correctly detected, but only 3.6%–29.7% of titles can be correctly referenced, suggesting limitations in traceability to the source guideline.
- Adherence performance is substantially lower, with adherence rates ranging from 21.8% to 63.2% depending on the model, indicating difficulty translating guideline knowledge into proper application.
- The benchmark includes validation via clinician human evaluation (56 clinicians), and the authors claim it is the first systematic benchmark revealing where LLMs fail in CPG detection and adherence for conversational clinical settings.
広告
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

I shipped Google's TurboQuant as a vLLM plugin 72 hours after the paper — here's what nobody else tested
Dev.to

We built a governance layer for AI-assisted development (with runtime validation and real system)
Dev.to
No AI system using the forward inference pass can ever be conscious.
Reddit r/artificial

What I wish I knew before running AI agents 24/7
Dev.to