HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks

arXiv cs.AI / 4/14/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The paper introduces HealthAdminBench, a new benchmark designed to evaluate LLM-based computer-use agents on end-to-end healthcare administration GUI workflows.
  • The benchmark covers four realistic interfaces (an EHR, two payer portals, and a fax system) and 135 fine-grained tasks across Prior Authorization, Appeals/Denials Management, and DME Order Processing.
  • Results across seven agent configurations show a persistent reliability gap: even when subtask performance is high, end-to-end task success is low.
  • The best end-to-end performer reported is Claude Opus 4.6 CUA with 36.3% task success, while GPT-5.4 CUA achieves the highest subtask success rate at 82.8%.
  • HealthAdminBench aims to provide a more rigorous evaluation foundation for progress toward safe and reliable automation of healthcare administrative operations.

Abstract

Healthcare administration accounts for over $1 trillion in annual spending, making it a promising target for LLM-based computer-use agents (CUAs). While clinical applications of LLMs have received significant attention, no benchmark exists for evaluating CUAs on end-to-end administrative workflows. To address this gap, we introduce HealthAdminBench, a benchmark comprising four realistic GUI environments: an EHR, two payer portals, and a fax system, and 135 expert-defined tasks spanning three administrative task types: Prior Authorization, Appeals and Denials Management, and Durable Medical Equipment (DME) Order Processing. Each task is decomposed into fine-grained, verifiable subtasks, yielding 1,698 evaluation points. We evaluate seven agent configurations under multiple prompting and observation settings and find that, despite strong subtask performance, end-to-end reliability remains low: the best-performing agent (Claude Opus 4.6 CUA) achieves only 36.3 percent task success, while GPT-5.4 CUA attains the highest subtask success rate (82.8 percent). These results reveal a substantial gap between current agent capabilities and the demands of real-world administrative workflows. HealthAdminBench provides a rigorous foundation for evaluating progress toward safe and reliable automation of healthcare administrative workflows.