End-to-End Evaluation and Governance of an EHR-Embedded AI Agent for Clinicians

arXiv cs.AI / 5/1/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper argues that clinical AI needs continuous governance beyond point-in-time evaluation, including ongoing monitoring, re-evaluation, and iterative improvement during deployment.
  • It proposes an end-to-end governance framework combining rubric validation, live deployment feedback, technical performance monitoring, cost tracking, and gated experimentation for system changes.
  • Applied to Hyperscribe—an EHR-embedded agent that turns ambient audio into structured chart updates—the team created 1,646 validated rubrics across 823 cases with 20 clinicians.
  • Controlled experiments across seven Hyperscribe versions improved median evaluation scores from 84% to 95%, and live feedback over three months shifted from mostly error reports toward more positive observations as failures were fixed.
  • Operational performance was strong, with a median processing time of 8.1 seconds per audio segment and a 99.6% effective completion rate thanks to retry mechanisms handling transient model errors.

Abstract

Clinical AI systems require not just point-in-time evaluation but continuous governance: the ongoing practice of monitoring, evaluating, iterating, and re-evaluating performance throughout deployment. We present an end-to-end framework of governance that integrates rubric validation, live deployment feedback, technical performance monitoring, and cost tracking, with controlled experimentation gating system changes before deployment. Applied to Hyperscribe, an EHR-embedded agent that converts ambient audio into structured chart updates, twenty clinicians authored 1,646 validated rubrics across 823 cases. Seven Hyperscribe versions were evaluated through controlled experiments, with median scores improving from 84% to 95%. Analysis of 107 live feedback entries over three months showed feedback composition shifting from 79% error reports and 14% positive observations to 30% errors and 45% positive observations as engineering interventions resolved failures. Median processing time per audio segment was 8.1 seconds with a 99.6% effective completion rate after retry mechanisms absorbed transient model errors. These results demonstrate that continuous, multi-channel governance of deployed clinical AI is both achievable and effective.