2026年に知っておくべきオープンソースの音声AIスタック

Dev.to / 2026/3/21

💬 オピニオンDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

要点

記事は、音声AIにおける「ChatGPTの瞬間」が到来したと主張し、エンドツーエンドのオープンソーススタックが現在、テレフォニーからオーケストレーションまで全てをカバーしていると述べている。
5層構成の本番用音声エージェントスタック（テレフォニー/トランスポート、STT、LLM、TTS、オーケストレーション）を提示し、それぞれの層に堅牢なオープンソースの選択肢がある。
NVIDIA Parakeetのリアルタイム転写性能を際立たせ、T4 GPUでのRTFxスコア3386を指摘し、ストリーミング向けに適しており、句読点と単語レベルのタイムスタンプが良好である。
Dograh、Pipecat、LiveKit のようなオープンソースプロジェクトをデプロイメントワークフローの一部として紹介し、今日、電話エージェントを展開する方法を示している。

"Voice AI just had its "ChatGPT moment."
A year ago, building a voice agent meant stitching together five different APIs and paying multiple vendors per minute of conversation. Today the open-source ecosystem has genuinely caught up - and it's moving fast.
I've been deep in this rabbit hole building Dograh, an open-source voice agent platform like n8n. This post is basically the research I wish existed when I started. Here's the full OSS stack - from raw audio all the way to a deployed phone agent.

The Stack at a Glance

A production voice agent has five layers:
Telephony / Transport  ->  Twilio, Vonage, WebRTC
STT (Speech-to-Text)   ->  Parakeet, Canary Qwen, Silero VAD
LLM                    ->  GPT-4o, Claude, Llama 3
TTS (Text-to-Speech)   ->  Chatterbox, Kokoro, XTTS-v2
Orchestration          ->  Dograh, Pipecat, LiveKit Agents

Every single layer now has solid open-source options. Let's go through them one by one.

Speech-to-Text

If you're building anything real-time, you need something built for streaming from the ground up. Whisper was a great model in 2022 and has kept up well with the voice agent use case. You may try whisper turbo first before other alternatives.
The best option right now for English real-time transcription is NVIDIA's Parakeet TDT 0.6B V2. It sits at #3 on the Hugging Face Open ASR leaderboard with a 6.05% WER, but the number that actually matters for voice agents is its RTFx score of 3386 - meaning it can process audio roughly 3000x faster than real-time. On a T4 GPU it's extremely affordable to run. It handles punctuation, capitalization, and word-level timestamps out of the box.

Python
import nemo.collections.asr as nemo_asr

model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v2")
transcript = model.transcribe(["audio.wav"])
print(transcript[0])

If accuracy matters more than raw speed - say you're transcribing medical calls or anything where a wrong word is costly - NVIDIA's Canary Qwen 2.5B is the current accuracy leader on the Open ASR leaderboard at 5.63% WER. It combines ASR with LLM capabilities under the hood, which helps a lot with context and unusual vocabulary. The tradeoff is it's heavier to run and not as snappy for real-time use.
Either way, pair your STT model with Silero VAD. It's a small Voice Activity Detection model that tells your agent when someone is actually speaking. Without it you're either cutting people off mid-sentence or waiting awkwardly for them to finish. Every real-time voice pipeline needs this.

Text-to-Speech

Chatterbox from Resemble AI is the most exciting TTS release of the past year. It hits commercial-grade quality in blind tests, supports voice cloning, and has built-in audio watermarking for responsible use. If you're building anything customer-facing, this is probably your best open-source option right now.

Python
import torchaudio
from chatterbox.tts import ChatterboxTTS

model = ChatterboxTTS.from_pretrained(device="cuda")
wav = model.generate("Hello! This is Chatterbox speaking.")
torchaudio.save("output.wav", wav, model.sr)

For multilingual voice cloning, XTTS-v2 from Coqui is the go-to. It supports zero-shot cloning across 20+ languages with just a 6-second reference clip. Works well for audiobook tools, multilingual assistants, and anything where you need a consistent voice across languages.
If latency is your main constraint, look at Kokoro. It's only 82M parameters, runs on CPU, and can hit under 100ms on consumer hardware. The quality isn't Chatterbox-level but for edge deployments or high-throughput scenarios it's hard to beat.

Orchestration

This is the layer most developers underestimate. Orchestration ties STT, LLM, and TTS together and handles all the hard real-time stuff - barge-in when the user interrupts, turn detection, audio streaming, silence handling. Getting this wrong is what makes voice agents feel robotic even when the individual models are great.

Dograh is what I've been building. Think of it as the n8n of voice AI - a visual workflow builder where you can wire up your entire agent flow without touching code. It's the most direct open-source alternative to Vapi and Retell, and unlike those it's fully self-hostable with no per-minute markup.
It's pretty mature at this point. You get telephony out of the box via Twilio, Vonage, and Cloudonix, inbound and outbound calling, a fully built workflow builder, and one-command setup. All the plumbing is already there - knowledge base, dictionary, KYC, voicemail detection, variable extraction, QA you calls, multilingual, transfer call to human. It can be an inbound and outbound call as well as be deployed as a widget on your website. And you just bring your own LLM, STT, and TTS keys and plug them in.
curl -fsSL https://raw.githubusercontent.com/dograh-hq/dograh/main/install.sh | bash

The visual workflow builder is the big differentiator from raw frameworks like Pipecat or LiveKit Agents. Changing agent behavior means dragging a node, not editing Python and redeploying. For teams that want to iterate fast on agent logic without a full engineering cycle every time, that's a pretty big deal.

Pipecat is a Python framework built by Daily.co. It treats audio as a stream of typed frames and lets you build a pipeline of processors in sequence. It's transport-agnostic and gives you fine-grained control over every step. That control comes at a cost though - every time you want to change agent behavior, you're editing Python code, redeploying, and hoping nothing broke in the pipeline. For a team without dedicated voice engineering experience, the iteration loop gets slow fast.

Python
pipeline = Pipeline([
    transport.input(),
    silero_vad,
    deepgram_stt,
    openai_llm,
    cartesia_tts,
    transport.output(),
])

LiveKit/Agents has a cleaner API and abstracts away the WebRTC infrastructure, which makes the initial setup quicker. But the same problem applies - your agent logic lives in code. Any prompt change, flow tweak, or new use case means a code change and a redeploy. It's genuinely a good framework if you have engineers who live in this stuff full-time, but it's not something a small team can move fast with.

python
agent = VoicePipelineAgent(
    vad=silero.VAD.load(),
    stt=deepgram.STT(),
    llm=openai.LLM(),
    tts=cartesia.TTS(),
)

Both Pipecat and LiveKit Agents are solid if you want maximum control and have the engineering bandwidth to match. If you don't, you'll spend more time maintaining infrastructure than actually improving your agent.

Quick Comparison

Feature	Dograh	Pipecat	LiveKit Agents
Type	Platform	Framework	Framework
Visual Workflow Builder	Yes	No	No
Frontend UI	Yes	No	No
Telephony	Twilio, Vonage, Cloudonix	Twilio	SIP
Self-hostable	Yes	Yes	Yes
Setup Time	Minutes	Hours	Hours
Bring Your Own LLM	Yes	Yes	Yes
Open-source	Yes	Yes	Yes

The Mistake Most People Make

The biggest trap I see developers fall into when building voice agents: treating voice like chat with a microphone attached.
It's a completely different problem. The hard parts aren't the models - they're the real-time engineering. When do you cut off the STT and start the LLM? What happens when the user interrupts the agent mid-sentence? How do you handle answering machines on outbound calls? What about codec mismatches - PSTN phone lines use 8kHz u-law, but most STT models expect 16kHz PCM? These are the things that will actually bite you in production.

If you're just starting out, use either Dograh or Vapi or Retell to prototype. They're all fast and handle a lot of edge cases well. But once you hit serious volume or need custom logic, the open-source stack should be your default. Choose Dograh - and the cost difference is real. Running your own stack costs under $0.02 per minute. Managed platforms charge $0.10 to $0.15.

A Starter Stack That's Completely Free

If you want to get something running this weekend with zero API bills:
VAD - Silero VAD
STT - Parakeet TDT 0.6B V2 running locally
LLM - Llama 3.1 via Groq's free tier
TTS - Kokoro running locally
Orchestration - Dograh
Total infra cost - basically zero. Real latency - very achievable under 500ms end-to-end with a mid-range GPU.

What Are You Building?

Curious what stacks people are actually running in production. Is anyone using Kokoro for real-time agents? The latency numbers look great on paper but I haven't seen many production writeups.
Drop your stack in the comments.

I'm building Dograh - an open-source alternative to Vapi and Retell. If you're tired of vendor lock-in, check it out and star it if it's useful.

Translator

Azure OpenAI Service ドキュメント

200人のChatGPTユーザーに聞いた最大の不満。トップ5はすべてChatGPT Toolboxが解決する問題だった。

Reddit r/artificial

すべてのPRをセキュリティバグでレビューするAIを作った — その方法（2026）

Dev.to

[R] アイデンティティ・アンカーと権限階層の組み合わせが abliterated LLMs で 100% の拒否を実現 — システムプロンプトのみ、ファインチューニングなし

Reddit r/MachineLearning

私がリードを見つけ、個別化されたコールドメールを作成するAI SDRエージェントを構築した方法

Dev.to

2026年に知っておくべきオープンソースの音声AIスタック

要点

関連記事

Translator

200人のChatGPTユーザーに聞いた最大の不満。トップ5はすべてChatGPT Toolboxが解決する問題だった。

すべてのPRをセキュリティバグでレビューするAIを作った — その方法（2026）

[R] アイデンティティ・アンカーと権限階層の組み合わせが abliterated LLMs で 100% の拒否を実現 — システムプロンプトのみ、ファインチューニングなし

私がリードを見つけ、個別化されたコールドメールを作成するAI SDRエージェントを構築した方法

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer