HealthBench Professional: Evaluating Large Language Models on Real Clinician Chats
arXiv cs.CL / 5/1/2026
📰 NewsSignals & Early TrendsModels & Research
Key Points
- HealthBench Professional is introduced as an open benchmark to evaluate large language models specifically on real-world clinician chat tasks, based on how clinicians actually use ChatGPT during their work.
- The benchmark is structured around three core clinical use cases—care consult, writing/documentation, and medical research—with physician-authored clinician–ChatGPT conversation examples.
- Scoring is performed using physician-written rubrics that are iteratively adjudicated across at least three physician reviewers over multiple phases, aiming for reliable assessment.
- The dataset intentionally emphasizes high-quality, representative, and difficult cases for frontier models, including enrichment of hard examples and a significant portion involving deliberate adversarial testing.
- In evaluated results, the top system (GPT-5.4 in ChatGPT for Clinicians) outperforms other evaluated models and human physicians, providing a baseline for tracking progress in clinically relevant performance.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works
Dev.to

Text-to-image is easy. Chaining LLMs to generate, critique, and iterate on images autonomously is a routing nightmare. AgentSwarms now supports Image generation playground and creative media workflows!
Reddit r/artificial

Announcing the NVIDIA Nemotron 3 Super Build Contest
Dev.to

75% of Sites Blocking AI Bots Still Get Cited. Here Is Why Blocking Does Not Work.
Dev.to

How to Fix OpenClaw Tool Calling Issues
Dev.to