PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments

arXiv cs.AI / 5/5/2026

📰 NewsSignals & Early TrendsIndustry & Market MovesModels & Research

共有:

Key Points

The paper introduces PhysicianBench, a benchmark designed to evaluate LLM agents performing physician tasks inside realistic electronic health record (EHR) environments rather than relying on static knowledge recall.
It focuses on long-horizon, multi-step clinical workflows by adapting 100 real consultation cases across 21 specialties, with tasks requiring an average of 27 tool calls and spanning diagnosis interpretation, medication prescribing, and treatment planning.
Each benchmark task is instantiated using real patient records and accessed via standard EHR vendor-style APIs, with completion verified through structured checkpoints (670 total) using execution-grounded scripts.
Testing 13 proprietary and open-source LLM agents shows a large capability gap: the best agent reaches only 46% pass@1 success, while open-source models top out at 19%.
PhysicianBench aims to provide a more realistic measurement of progress toward autonomous clinical agents by enforcing verifiable execution against the EHR environment.

Abstract

We introduce PhysicianBench, a benchmark for evaluating LLM agents on physician tasks grounded in real clinical setting within electronic health record (EHR) environments. Existing medical agent benchmarks primarily focus on static knowledge recall, single-step atomic actions, or action intent without verifiable execution against the environment. As a result, they fail to capture the long-horizon, composite workflows that characterize real clinical systems. PhysicianBench comprises 100 long-horizon tasks adapted from real consultation cases between primary care and subspecialty physicians, with each task independently reviewed by a separate panel of physicians. Tasks are instantiated in an EHR environment with real patient records and accessed through the same standard APIs used by commercial EHR vendors. Tasks span 21 specialties (e.g., cardiology, endocrinology, oncology, psychiatry) and diverse workflow types (e.g., diagnosis interpretation, medication prescribing, treatment planning), requiring an average of 27 tool calls per task. Solving each task requires retrieving data across encounters, reasoning over heterogeneous clinical information, executing consequential clinical actions, and producing clinical documentation. Each task is decomposed into structured checkpoints (670 in total across the benchmark) capturing distinct stages of completion graded by task-specific scripts with execution-grounded verification. Across 13 proprietary and open-source LLM agents, the best-performing model achieves only 46% success rate (pass@1), while open-source models reach at most 19%, revealing a substantial gap between current agent capabilities and the demands of real-world clinical workflows. PhysicianBench provides a realistic and execution-grounded benchmark for measuring progress toward autonomous clinical agents.

Black Hat USA

AI Business

When Claims Freeze Because a Provider Record Drifted: The Case for Enrollment Repair Agents

Dev.to

The Cash Is Already Earned: Why Construction Pay Application Exceptions Fit an Agent Better Than SaaS

Dev.to

Why Ship-and-Debit Claim Recovery Is a Better Agent Wedge Than Another “AI Back Office” Tool

Dev.to

I Built an AI-Powered Chinese BaZi (八字) Fortune Teller — Here's What DeepSeek Revealed About Destiny

Dev.to

PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments

Key Points

Abstract

Related Articles

Black Hat USA

When Claims Freeze Because a Provider Record Drifted: The Case for Enrollment Repair Agents

The Cash Is Already Earned: Why Construction Pay Application Exceptions Fit an Agent Better Than SaaS

Why Ship-and-Debit Claim Recovery Is a Better Agent Wedge Than Another “AI Back Office” Tool

I Built an AI-Powered Chinese BaZi (八字) Fortune Teller — Here's What DeepSeek Revealed About Destiny

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer