End-to-End Evaluation and Governance of an EHR-Embedded AI Agent for Clinicians

arXiv cs.AI / 5/1/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper argues that clinical AI needs continuous governance beyond point-in-time evaluation, including ongoing monitoring, re-evaluation, and iterative improvement during deployment.
It proposes an end-to-end governance framework combining rubric validation, live deployment feedback, technical performance monitoring, cost tracking, and gated experimentation for system changes.
Applied to Hyperscribe—an EHR-embedded agent that turns ambient audio into structured chart updates—the team created 1,646 validated rubrics across 823 cases with 20 clinicians.
Controlled experiments across seven Hyperscribe versions improved median evaluation scores from 84% to 95%, and live feedback over three months shifted from mostly error reports toward more positive observations as failures were fixed.
Operational performance was strong, with a median processing time of 8.1 seconds per audio segment and a 99.6% effective completion rate thanks to retry mechanisms handling transient model errors.

Abstract

Clinical AI systems require not just point-in-time evaluation but continuous governance: the ongoing practice of monitoring, evaluating, iterating, and re-evaluating performance throughout deployment. We present an end-to-end framework of governance that integrates rubric validation, live deployment feedback, technical performance monitoring, and cost tracking, with controlled experimentation gating system changes before deployment. Applied to Hyperscribe, an EHR-embedded agent that converts ambient audio into structured chart updates, twenty clinicians authored 1,646 validated rubrics across 823 cases. Seven Hyperscribe versions were evaluated through controlled experiments, with median scores improving from 84% to 95%. Analysis of 107 live feedback entries over three months showed feedback composition shifting from 79% error reports and 14% positive observations to 30% errors and 45% positive observations as engineering interventions resolved failures. Median processing time per audio segment was 8.1 seconds with a 99.6% effective completion rate after retry mechanisms absorbed transient model errors. These results demonstrate that continuous, multi-channel governance of deployed clinical AI is both achievable and effective.

Every handle invocation on BizNode gets a WFID — a universal transaction reference for accountability. Full audit trail,...

Dev.to

I deployed AI agents across AWS, GCP, and Azure without a VPN. Here is how it works.

Dev.to

Panduan Lengkap TestSprite MCP Server — Dokumentasi Getting Started dalam Bahasa Indonesia

Dev.to

Every Telegram conversation becomes a qualified lead. BizNode captures name, email, and business details automatically while...

Dev.to

MCP, Skills, AI Agents, and New Models: The New Stack for Software Development

Dev.to

End-to-End Evaluation and Governance of an EHR-Embedded AI Agent for Clinicians

Key Points

Abstract

Related Articles

Every handle invocation on BizNode gets a WFID — a universal transaction reference for accountability. Full audit trail,...

I deployed AI agents across AWS, GCP, and Azure without a VPN. Here is how it works.

Panduan Lengkap TestSprite MCP Server — Dokumentasi Getting Started dalam Bahasa Indonesia

Every Telegram conversation becomes a qualified lead. BizNode captures name, email, and business details automatically while...

MCP, Skills, AI Agents, and New Models: The New Stack for Software Development

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer