Perfecting Human-AI Interaction at Clinical Scale. Turning Production Signals into Safer, More Human Conversations

arXiv cs.CL / 4/1/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that healthcare conversational AI should be optimized for real patient interactions (imperfect audio, indirect intent, mid-call language shifts, and compliance-critical delivery), not only for benchmark accuracy.
It presents a production-validated framework using live signals from 115M+ patient-AI interactions plus clinician-led testing with 7K+ clinicians and 500K+ test calls to surface real-world failure modes.
The authors identify actionable “interaction intelligence” cues—paralinguistics, turn-taking, clarification triggers, escalation markers, multilingual continuity, and workflow confirmations—that curated datasets can miss.
It emphasizes that healthcare-grade safety may require multi-LLM redundancy via governed orchestration and independent checks, plus vertical integration across ASR, clarification/repair, ambient speech, and latency-aware model/hardware choices.
Reported deployment results claim a Polaris clinical safety score of 99.9%, improved patient experience (avg rating 8.95), and a 50% reduction in ASR errors versus enterprise ASR.

Abstract

Healthcare conversational AI agents shouldn't be optimized only for clean benchmark accuracy in production-first regime; they must be optimized for the lived reality of patient conversations, where audio is imperfect, intent is indirect, language shifts mid-call, and compliance hinges on how guidance is delivered. We present a production-validated framework grounded in real-time signals from 115M+ live patient-AI interactions and clinician-led testing (7K+ licensed clinicians; 500K+ test calls). These in-the-wild cues -- paralinguistics, turn-taking dynamics, clarification triggers, escalation markers, multilingual continuity, and workflow confirmations -- reveal failure modes that curated data misses and provide actionable training and evaluation signals for safety and reliability. We further show why healthcare-grade safety cannot rely on a single LLM: long-horizon dialogue and limited attention demand redundancy via governed orchestration, independent checks, and verification. Many apparent "reasoning" errors originate upstream, motivating vertical integration across contextual ASR, clarification/repair, ambient speech handling, and latency-aware model/hardware choices. Treating interaction intelligence (tone, pacing, empathy, clarification, turn-taking) as first-class safety variables, we drive measurable gains in safety, documentation, task completion, and equity in building the safest generative AI solution for autonomous patient-facing care. Deployed across more than 10 million real patient calls, Polaris attains a clinical safety score of 99.9%, while significantly improving patient experience with average patient rating of 8.95 and reducing ASR errors by 50% over enterprise ASR. These results establish real-world interaction intelligence as a critical -- and previously underexplored -- determinant of safety and reliability in patient-facing clinical AI systems.