Built a Voice Agents from Scratch GitHub tutorial: mic > Whisper > local LLM (GGUF) > Kokoro > speaker, fully local, no API keys

Reddit r/LocalLLaMA / 5/4/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Read original →

共有:

Key Points

The GitHub tutorial repository “voice-agents-from-scratch” presents a real-time, end-to-end fully local voice agent pipeline: microphone capture → Whisper STT → local GGUF LLM (via llama.cpp) → Kokoro TTS → speaker output.
It emphasizes streaming throughout the pipeline, so speech output can start as soon as partial LLM results are available, making the interaction feel conversational rather than chatbot-like.
The repo is organized as chapter-by-chapter scripts (Audio IO, STT, TTS, full voice loop, real-time systems, tools, personality, projects) with brief CODE.md walkthroughs and a small shared library to show how components compose.
The author highlights that running everything locally helps expose real latency sources (warm-up time, first-audio time, and streaming chunk size) instead of hiding them behind abstractions.
The author plans an additional deployment chapter (possibly using modal.com) and shares that they initially considered Node.js but found the ecosystem lacking for Whisper support and general audio processing.

Been building this for a while and finally cleaned it up enough to share.

voice-agents-from-scratch is a numbered, chapter-by-chapter repo that walks the full real-time pipeline:

Microphone capture
Whisper for STT
Local GGUF LLM (via llama.cpp)
Kokoro for TTS
Speaker output

Everything streams - you don't wait for the full LLM response before TTS starts speaking. That's the part that makes it feel like a real conversation instead of a chatbot with a voice skin.

Chapters:

Intro
Audio IO
Speech to Text (STT)
Text to Speech (TTS)
Full voice loop
Real time systems
Tools
Personality
Projects

Each chapter is a runnable script + a short CODE.md walkthrough. There's also a small shared library so you can see how the pieces compose into a real system, not just isolated calls.

Why fully local matters here: you can actually see where latency lives. Warm-up, first-audio time, streaming chunk size - these aren't abstractions when you're running it on your own machine.

I plan a deployment chapter, thinking of using modal.com for it, wishes and suggestions are welcome.

Repo: https://github.com/pguso/voice-agents-from-scratch

I originally wanted to publish this repo using Node.js, but the ecosystem in Node.js is really not ready. There is a very good Kokoro-JS npm package, but when it comes to Whisper support or audio processing in general there are no good options.

Happy to answer questions about the architecture or tradeoffs I ran into.

submitted by /u/purellmagents
[link] [comments]

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 5/4DailyView insight →

Black Hat USA

AI Business

Sparse Federated Representation Learning for deep-sea exploration habitat design in carbon-negative infrastructure

Dev.to

Building a daily AI news brief in 325 lines of Python

Dev.to

Signal Lock: Closing the Prediction-Execution Gap in Agentic AI Systems

Reddit r/artificial

A Developer’s Guide to Systematic Prompting: Mastering Negative Constraints, Structured JSON Outputs, and Multi-Hypothesis Verbalized Sampling

MarkTechPost

Built a Voice Agents from Scratch GitHub tutorial: mic > Whisper > local LLM (GGUF) > Kokoro > speaker, fully local, no API keys

Key Points

💡 Insights using this article

Related Articles

Black Hat USA

Sparse Federated Representation Learning for deep-sea exploration habitat design in carbon-negative infrastructure

Building a daily AI news brief in 325 lines of Python

Signal Lock: Closing the Prediction-Execution Gap in Agentic AI Systems

A Developer’s Guide to Systematic Prompting: Mastering Negative Constraints, Structured JSON Outputs, and Multi-Hypothesis Verbalized Sampling

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer