"Voice AI just had its "ChatGPT moment."
A year ago, building a voice agent meant stitching together five different APIs and paying multiple vendors per minute of conversation. Today the open-source ecosystem has genuinely caught up - and it's moving fast.
I've been deep in this rabbit hole building Dograh, an open-source voice agent platform like n8n. This post is basically the research I wish existed when I started. Here's the full OSS stack - from raw audio all the way to a deployed phone agent.
The Stack at a Glance
A production voice agent has five layers:
Telephony / Transport -> Twilio, Vonage, WebRTC
STT (Speech-to-Text) -> Parakeet, Canary Qwen, Silero VAD
LLM -> GPT-4o, Claude, Llama 3
TTS (Text-to-Speech) -> Chatterbox, Kokoro, XTTS-v2
Orchestration -> Dograh, Pipecat, LiveKit Agents
Every single layer now has solid open-source options. Let's go through them one by one.
Speech-to-Text
If you're building anything real-time, you need something built for streaming from the ground up. Whisper was a great model in 2022 and has kept up well with the voice agent use case. You may try whisper turbo first before other alternatives.
The best option right now for English real-time transcription is NVIDIA's Parakeet TDT 0.6B V2. It sits at #3 on the Hugging Face Open ASR leaderboard with a 6.05% WER, but the number that actually matters for voice agents is its RTFx score of 3386 - meaning it can process audio roughly 3000x faster than real-time. On a T4 GPU it's extremely affordable to run. It handles punctuation, capitalization, and word-level timestamps out of the box.
Python
import nemo.collections.asr as nemo_asr
model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v2")
transcript = model.transcribe(["audio.wav"])
print(transcript[0])
If accuracy matters more than raw speed - say you're transcribing medical calls or anything where a wrong word is costly - NVIDIA's Canary Qwen 2.5B is the current accuracy leader on the Open ASR leaderboard at 5.63% WER. It combines ASR with LLM capabilities under the hood, which helps a lot with context and unusual vocabulary. The tradeoff is it's heavier to run and not as snappy for real-time use.
Either way, pair your STT model with Silero VAD. It's a small Voice Activity Detection model that tells your agent when someone is actually speaking. Without it you're either cutting people off mid-sentence or waiting awkwardly for them to finish. Every real-time voice pipeline needs this.
Text-to-Speech
Chatterbox from Resemble AI is the most exciting TTS release of the past year. It hits commercial-grade quality in blind tests, supports voice cloning, and has built-in audio watermarking for responsible use. If you're building anything customer-facing, this is probably your best open-source option right now.
Python
import torchaudio
from chatterbox.tts import ChatterboxTTS
model = ChatterboxTTS.from_pretrained(device="cuda")
wav = model.generate("Hello! This is Chatterbox speaking.")
torchaudio.save("output.wav", wav, model.sr)
For multilingual voice cloning, XTTS-v2 from Coqui is the go-to. It supports zero-shot cloning across 20+ languages with just a 6-second reference clip. Works well for audiobook tools, multilingual assistants, and anything where you need a consistent voice across languages.
If latency is your main constraint, look at Kokoro. It's only 82M parameters, runs on CPU, and can hit under 100ms on consumer hardware. The quality isn't Chatterbox-level but for edge deployments or high-throughput scenarios it's hard to beat.
Orchestration
This is the layer most developers underestimate. Orchestration ties STT, LLM, and TTS together and handles all the hard real-time stuff - barge-in when the user interrupts, turn detection, audio streaming, silence handling. Getting this wrong is what makes voice agents feel robotic even when the individual models are great.
Dograh is what I've been building. Think of it as the n8n of voice AI - a visual workflow builder where you can wire up your entire agent flow without touching code. It's the most direct open-source alternative to Vapi and Retell, and unlike those it's fully self-hostable with no per-minute markup.
It's pretty mature at this point. You get telephony out of the box via Twilio, Vonage, and Cloudonix, inbound and outbound calling, a fully built workflow builder, and one-command setup. All the plumbing is already there - knowledge base, dictionary, KYC, voicemail detection, variable extraction, QA you calls, multilingual, transfer call to human. It can be an inbound and outbound call as well as be deployed as a widget on your website. And you just bring your own LLM, STT, and TTS keys and plug them in.
curl -fsSL https://raw.githubusercontent.com/dograh-hq/dograh/main/install.sh | bash
The visual workflow builder is the big differentiator from raw frameworks like Pipecat or LiveKit Agents. Changing agent behavior means dragging a node, not editing Python and redeploying. For teams that want to iterate fast on agent logic without a full engineering cycle every time, that's a pretty big deal.
Pipecat is a Python framework built by Daily.co. It treats audio as a stream of typed frames and lets you build a pipeline of processors in sequence. It's transport-agnostic and gives you fine-grained control over every step. That control comes at a cost though - every time you want to change agent behavior, you're editing Python code, redeploying, and hoping nothing broke in the pipeline. For a team without dedicated voice engineering experience, the iteration loop gets slow fast.
Python
pipeline = Pipeline([
transport.input(),
silero_vad,
deepgram_stt,
openai_llm,
cartesia_tts,
transport.output(),
])
LiveKit/Agents has a cleaner API and abstracts away the WebRTC infrastructure, which makes the initial setup quicker. But the same problem applies - your agent logic lives in code. Any prompt change, flow tweak, or new use case means a code change and a redeploy. It's genuinely a good framework if you have engineers who live in this stuff full-time, but it's not something a small team can move fast with.
python
agent = VoicePipelineAgent(
vad=silero.VAD.load(),
stt=deepgram.STT(),
llm=openai.LLM(),
tts=cartesia.TTS(),
)
Both Pipecat and LiveKit Agents are solid if you want maximum control and have the engineering bandwidth to match. If you don't, you'll spend more time maintaining infrastructure than actually improving your agent.
Quick Comparison
| Feature | Dograh | Pipecat | LiveKit Agents |
|---|---|---|---|
| Type | Platform | Framework | Framework |
| Visual Workflow Builder | Yes | No | No |
| Frontend UI | Yes | No | No |
| Telephony | Twilio, Vonage, Cloudonix | Twilio | SIP |
| Self-hostable | Yes | Yes | Yes |
| Setup Time | Minutes | Hours | Hours |
| Bring Your Own LLM | Yes | Yes | Yes |
| Open-source | Yes | Yes | Yes |
The Mistake Most People Make
The biggest trap I see developers fall into when building voice agents: treating voice like chat with a microphone attached.
It's a completely different problem. The hard parts aren't the models - they're the real-time engineering. When do you cut off the STT and start the LLM? What happens when the user interrupts the agent mid-sentence? How do you handle answering machines on outbound calls? What about codec mismatches - PSTN phone lines use 8kHz u-law, but most STT models expect 16kHz PCM? These are the things that will actually bite you in production.
If you're just starting out, use either Dograh or Vapi or Retell to prototype. They're all fast and handle a lot of edge cases well. But once you hit serious volume or need custom logic, the open-source stack should be your default. Choose Dograh - and the cost difference is real. Running your own stack costs under $0.02 per minute. Managed platforms charge $0.10 to $0.15.
A Starter Stack That's Completely Free
If you want to get something running this weekend with zero API bills:
VAD - Silero VAD
STT - Parakeet TDT 0.6B V2 running locally
LLM - Llama 3.1 via Groq's free tier
TTS - Kokoro running locally
Orchestration - Dograh
Total infra cost - basically zero. Real latency - very achievable under 500ms end-to-end with a mid-range GPU.
What Are You Building?
Curious what stacks people are actually running in production. Is anyone using Kokoro for real-time agents? The latency numbers look great on paper but I haven't seen many production writeups.
Drop your stack in the comments.
I'm building Dograh - an open-source alternative to Vapi and Retell. If you're tired of vendor lock-in, check it out and star it if it's useful.