We Built a Voice AI Receptionist in 8 Weeks — Every Decision We Made and Why

Dev.to / 4/27/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageIndustry & Market MovesModels & Research

Read original →

共有:

Key Points

Autor built a production voice AI receptionist for Canadian dental and healthcare clinics in just eight weeks, enabling 24/7 automated handling of live patient calls.
The team benchmarked multiple speech-to-text and text-to-speech providers, selecting Deepgram for low-latency, high-accuracy Canadian English transcription and ElevenLabs for near-human, non-robotic TTS.
For telephony and audio streaming, Autor chose Twilio’s Media Streams API to reliably stream bidirectional audio via WebSockets and simplify handling edge cases.
In selecting the conversation “brain,” Autor prioritized a model that could manage patient intent, scheduling, insurance Q&A, and escalation to humans, ultimately favoring Anthropic Claude over GPT-4.

Eight weeks. That's how long it took our team at Autor to go from "we should build a voice AI receptionist for healthcare clinics" to handling live patient calls 24/7. Not a demo. Not a proof of concept. A production system that now processes thousands of automated calls per month for dental and healthcare clients across Canada. Here's every technical and business decision we made along the way, and the reasoning behind each one.

The Starting Point

A dental clinic in Ontario came to us with a problem we'd heard a dozen times: they were losing patients because nobody answered the phone after hours. Their staff spent 3+ hours per day on calls that followed the same script — confirming appointments, answering insurance questions, routing urgent calls. They wanted automation, but every off-the-shelf solution they'd tried sounded robotic and confused patients.

We'd already built 40+ AI products at that point. We knew the gap between a voice AI demo and a voice AI that handles real calls from real patients who are sometimes anxious, sometimes angry, and sometimes just confused. We scoped 8 weeks and got to work.

Week 1–2: Choosing the Voice Stack

The first decision was the hardest: which speech-to-text and text-to-speech providers to use. We benchmarked four STT options and three TTS options over two weeks.

Speech-to-text: Deepgram won. We tested Deepgram, Google Cloud Speech, AWS Transcribe, and Whisper (self-hosted). Deepgram gave us the best combination of latency and accuracy for Canadian English with diverse accents. Our benchmarks showed Deepgram averaged 180ms first-byte latency versus 320ms for Google Cloud Speech. For a phone conversation, that 140ms difference is the gap between natural and awkward. Whisper was the most accurate but unusable for real-time — even on a GPU instance, streaming latency was 400ms+.

Text-to-speech: ElevenLabs. We needed a voice that didn't trigger the "I'm talking to a robot" response. ElevenLabs' Turbo v2 model gave us near-human quality at 150ms latency. We tested with 30 real patients in a blind study — 22 of them didn't realize they were talking to AI until we told them.

Telephony: Twilio. This wasn't a hard choice. Twilio's Media Streams API gave us bidirectional audio streaming over WebSocket. We'd used it before, understood the edge cases, and knew the Canadian number provisioning was solid. We briefly considered Vonage but their WebSocket implementation had reliability issues in our testing.

Week 3–4: The Brain — Why We Chose Anthropic Claude Over GPT-4

This is where it gets interesting. We needed a language model that could handle the core conversation logic — understanding patient intent, managing appointment scheduling, handling insurance questions, and knowing when to transfer to a human.

We ran both GPT-4 and Anthropic Claude through 200 simulated patient conversations. The results surprised us.

Claude was better at saying "I don't know." In healthcare, making something up is worse than admitting uncertainty. When we threw edge cases at both models — rare insurance scenarios, questions about specific procedures the clinic didn't offer — GPT-4 was more likely to confabulate a plausible-sounding answer. Claude was more likely to say it wasn't sure and offer to connect the patient with staff. For a healthcare application, that behavior is worth its weight in gold.

Claude's instruction-following was more consistent. We needed the model to stay strictly within its role as a receptionist. No medical advice, ever. No promises about pricing without checking the database. After prompt engineering both models for a week, Claude held its boundaries more reliably across 1,000+ test conversations.

We built the conversation engine on Claude with structured tool use. The model calls functions to check appointment availability, look up patient records, and route calls — all through a clean tool-use interface rather than string parsing.

Week 5: The Integration Layer

Week 5 was all plumbing. We built the integration layer that connects the voice AI to the clinic's actual systems.

Architecture: NestJS backend on AWS ECS. We chose NestJS because our team thinks in TypeScript and NestJS gives us dependency injection and module structure without the bloat. The service runs on ECS Fargate — we didn't want to manage servers, and Fargate's auto-scaling handles call volume spikes at 9am when every patient calls at once.

Database: PostgreSQL with Prisma ORM. Every call gets logged — full transcript, intent classification, actions taken, duration, and outcome. This data is what makes the system get better over time. We chose Prisma because the type safety between our TypeScript code and the database eliminated an entire class of bugs we'd dealt with in past projects using raw SQL.

CRM integration: HubSpot and Salesforce. Most clinics use one or the other. We built adapters for both so the AI receptionist can pull patient history and push call summaries. The HubSpot integration took 3 days. Salesforce took 8. If you've worked with the Salesforce API, you know why.

Week 6: Latency Optimization — The Make-or-Break Week

Week 6 nearly broke us. Our end-to-end latency — from when the patient stops speaking to when the AI starts responding — was averaging 2.1 seconds. That's unacceptable for a phone conversation. Anything over 1.2 seconds and callers start saying "hello?" again.

Here's what we did to get it under 800ms:

Streaming everything. We switched from waiting for complete STT transcription to streaming partial results. As soon as Deepgram gives us a stable partial transcript, we start sending it to Claude. Claude streams its response back, and we start TTS on the first sentence while Claude is still generating the rest. This pipelining cut 600ms off the round trip.

Prompt caching. Claude's prompt caching feature was a massive win. Our system prompt is about 2,000 tokens — clinic-specific information, conversation rules, available tools. With prompt caching, that system prompt gets processed once and reused across turns. This alone saved 200-300ms per turn.

Connection pooling and keep-alive. We keep persistent WebSocket connections to Twilio, persistent HTTP/2 connections to Claude's API, and persistent connections to Deepgram. Cold-starting any of these adds 100-200ms.

After a week of optimization, we hit a median response time of 740ms. The 95th percentile was 1.1 seconds. Patients stopped noticing the delay.

Week 7: Edge Cases and Failure Modes

We spent all of week 7 on what happens when things go wrong. This is the week that separates a demo from a product.

What if the patient speaks a language the AI doesn't handle? We built language detection into the first 3 seconds of the call. If we detect French, Mandarin, or Cantonese (the three most common non-English languages for our Ontario clinics), we route immediately to a human or play a pre-recorded message in that language.

What if the AI gets confused? We built a confidence scoring system. If Claude's response confidence drops below our threshold for two consecutive turns, the system says "Let me connect you with our team" and transfers the call. No patient should ever be stuck in a loop with a confused AI.

What if Deepgram or Claude goes down? Circuit breakers on every external dependency. If STT fails, we fall back to a DTMF menu ("press 1 for appointments"). If the LLM fails, we route to voicemail with a text notification to staff. We tested every failure mode by literally killing services in production during low-traffic hours.

What about PIPEDA compliance? This is Canada — we have to handle patient data under PIPEDA, not HIPAA. All call recordings are encrypted at rest and in transit. Transcripts are stored in Canadian data centers. We built a data retention policy that auto-deletes recordings after the clinic's specified retention period. We worked with a privacy consultant to ensure our consent flow at the start of each call met PIPEDA requirements.

Week 8: Launch and the First 1,000 Calls

We launched on a Thursday at 5pm — right when the clinic closed. The AI receptionist would handle all after-hours calls for the weekend as a soft launch.

The first weekend: 127 calls. 89 handled completely by AI. 24 transferred to the on-call number (correctly — these were urgent or complex). 14 hung up before the AI could help. That's a 70% full-automation rate on day one.

Within the first month, the automation rate climbed to 82% as we tuned prompts based on real call data. The clinic saved an estimated 45 staff-hours per month. More importantly, they stopped losing after-hours patients to competitors who answered their phones.

What We'd Do Differently

Start with latency budgets, not features. We should have set our 800ms latency target on day one and built every component to that budget. Instead, we built features first and then scrambled in week 6 to optimize. The architecture would have been cleaner if latency was a first-class constraint from the start.

Build the monitoring dashboard earlier. We didn't have real-time call monitoring until week 7. That meant weeks 5 and 6 were partially blind. Now, every Loquent deployment gets a monitoring dashboard on day one.

Test with real patients sooner. Our simulated conversations, no matter how good, didn't capture the way real patients talk on the phone. They pause mid-sentence. They talk to someone else in the room. They put the phone down and come back. We caught these patterns in week 8 when we should have been finding them in week 4.

Key Takeaways

Latency is the product. In voice AI, response time determines whether your system feels like a helpful receptionist or an annoying robot. Budget for it from day one.
Pick models for their failure modes, not their best cases. Claude won over GPT-4 not because it was smarter, but because it failed more gracefully — admitting uncertainty instead of making things up.
Healthcare voice AI in Canada is viable right now. PIPEDA compliance is manageable, Canadian data residency options exist for all major cloud providers, and patients are more accepting of AI receptionists than most people assume.
The 80/20 rule applies hard. Getting to 80% automation was 3 weeks of work. Getting from 80% to 82% was another 4 weeks of prompt tuning and edge case handling. Plan your timeline accordingly.
Build the transfer path first. The AI knowing when to hand off to a human is more important than handling every scenario. A graceful transfer builds trust; a confused AI destroys it.

This is the system we turned into Loquent, our production voice AI platform. It now serves multiple healthcare and dental clients across Canada, handling thousands of calls per month.

If you're building something similar, we'd love to hear about it. Reach out at hello@autor.ca or visit autor.ca.

Black Hat USA

AI Business

Context Compression in .NET

Dev.to

Subagents: The Building Block of Agentic AI

Dev.to

Canva apologizes after its AI tool replaces ‘Palestine’ in designs

The Verge

Why Cursor Keeps Writing MD5 Password Hashes (CWE-328)

Dev.to

We Built a Voice AI Receptionist in 8 Weeks — Every Decision We Made and Why

Key Points

The Starting Point

Week 1–2: Choosing the Voice Stack

Week 3–4: The Brain — Why We Chose Anthropic Claude Over GPT-4

Week 5: The Integration Layer

Week 6: Latency Optimization — The Make-or-Break Week

Week 7: Edge Cases and Failure Modes

Week 8: Launch and the First 1,000 Calls

What We'd Do Differently

Key Takeaways

Related Articles

Black Hat USA

Context Compression in .NET

Subagents: The Building Block of Agentic AI

Canva apologizes after its AI tool replaces ‘Palestine’ in designs

Why Cursor Keeps Writing MD5 Password Hashes (CWE-328)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer