HealthBench Professional: Evaluating Large Language Models on Real Clinician Chats

arXiv cs.CL / 5/1/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

HealthBench Professional is introduced as an open benchmark to evaluate large language models specifically on real-world clinician chat tasks, based on how clinicians actually use ChatGPT during their work.
The benchmark is structured around three core clinical use cases—care consult, writing/documentation, and medical research—with physician-authored clinician–ChatGPT conversation examples.
Scoring is performed using physician-written rubrics that are iteratively adjudicated across at least three physician reviewers over multiple phases, aiming for reliable assessment.
The dataset intentionally emphasizes high-quality, representative, and difficult cases for frontier models, including enrichment of hard examples and a significant portion involving deliberate adversarial testing.
In evaluated results, the top system (GPT-5.4 in ChatGPT for Clinicians) outperforms other evaluated models and human physicians, providing a baseline for tracking progress in clinically relevant performance.

Abstract

Millions of clinicians use ChatGPT to support clinical care, but evaluations of the most common use cases in model-clinician conversations are limited. We introduce HealthBench Professional, an open benchmark for evaluating large language models on real tasks that clinicians bring to ChatGPT in the course of their work. The benchmark is organized around three common use cases central to clinical practice: care consult, writing and documentation, and medical research. Each example includes a physician-authored conversation with ChatGPT for Clinicians and is scored via rubrics written and iteratively adjudicated by three or more physicians across three phases. HealthBench Professional examples were carefully selected for quality, representativeness, and difficulty for OpenAI's current frontier models, to enable continued measurement of progress. Difficult examples for recent OpenAI models were enriched by roughly 3.5 times relative to the candidate pool of 15,079 examples. Additionally, about one-third of examples involve physicians conducting deliberate adversarial testing of models. As a strong baseline, we also collected human physician responses for all tasks (unbounded time, specialist-matched, web access). The best scoring system, GPT-5.4 in ChatGPT for Clinicians, outperforms base GPT-5.4, all other models, and human physicians. We hope HealthBench Professional provides the healthcare AI community a measure to track frontier model progress in real-world clinical tasks and build systems that clinicians can trust to improve care.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 5/1DailyView insight →

Why Autonomous Coding Agents Keep Failing — And What Actually Works

Dev.to

Text-to-image is easy. Chaining LLMs to generate, critique, and iterate on images autonomously is a routing nightmare. AgentSwarms now supports Image generation playground and creative media workflows!

Reddit r/artificial

Announcing the NVIDIA Nemotron 3 Super Build Contest

Dev.to

75% of Sites Blocking AI Bots Still Get Cited. Here Is Why Blocking Does Not Work.

Dev.to

How to Fix OpenClaw Tool Calling Issues

Dev.to

HealthBench Professional: Evaluating Large Language Models on Real Clinician Chats

Key Points

Abstract

💡 Insights using this article

Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works

Text-to-image is easy. Chaining LLMs to generate, critique, and iterate on images autonomously is a routing nightmare. AgentSwarms now supports Image generation playground and creative media workflows!

Announcing the NVIDIA Nemotron 3 Super Build Contest

75% of Sites Blocking AI Bots Still Get Cited. Here Is Why Blocking Does Not Work.

How to Fix OpenClaw Tool Calling Issues

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer