HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks
arXiv cs.AI / 4/14/2026
📰 NewsSignals & Early TrendsModels & Research
Key Points
- The paper introduces HealthAdminBench, a new benchmark designed to evaluate LLM-based computer-use agents on end-to-end healthcare administration GUI workflows.
- The benchmark covers four realistic interfaces (an EHR, two payer portals, and a fax system) and 135 fine-grained tasks across Prior Authorization, Appeals/Denials Management, and DME Order Processing.
- Results across seven agent configurations show a persistent reliability gap: even when subtask performance is high, end-to-end task success is low.
- The best end-to-end performer reported is Claude Opus 4.6 CUA with 36.3% task success, while GPT-5.4 CUA achieves the highest subtask success rate at 82.8%.
- HealthAdminBench aims to provide a more rigorous evaluation foundation for progress toward safe and reliable automation of healthcare administrative operations.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

10 ChatGPT Prompts Every Genetic Counselor Should Be Using in 2025
Dev.to

The Memory Wall Can't Be Killed — 3 Papers Proving Every Architecture Hits It
Dev.to

The Physics Wall in 2026: 3 Papers That Show Why Node Shrinks Won't Save Us
Dev.to

Built a 5k usd MRR app with AI but still needed a developer
Dev.to