Built a Voice Agents from Scratch GitHub tutorial: mic > Whisper > local LLM (GGUF) > Kokoro > speaker, fully local, no API keys

Reddit r/LocalLLaMA / 5/4/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The GitHub tutorial repository “voice-agents-from-scratch” presents a real-time, end-to-end fully local voice agent pipeline: microphone capture → Whisper STT → local GGUF LLM (via llama.cpp) → Kokoro TTS → speaker output.
  • It emphasizes streaming throughout the pipeline, so speech output can start as soon as partial LLM results are available, making the interaction feel conversational rather than chatbot-like.
  • The repo is organized as chapter-by-chapter scripts (Audio IO, STT, TTS, full voice loop, real-time systems, tools, personality, projects) with brief CODE.md walkthroughs and a small shared library to show how components compose.
  • The author highlights that running everything locally helps expose real latency sources (warm-up time, first-audio time, and streaming chunk size) instead of hiding them behind abstractions.
  • The author plans an additional deployment chapter (possibly using modal.com) and shares that they initially considered Node.js but found the ecosystem lacking for Whisper support and general audio processing.

Been building this for a while and finally cleaned it up enough to share.

voice-agents-from-scratch is a numbered, chapter-by-chapter repo that walks the full real-time pipeline:

  • Microphone capture
  • Whisper for STT
  • Local GGUF LLM (via llama.cpp)
  • Kokoro for TTS
  • Speaker output

Everything streams - you don't wait for the full LLM response before TTS starts speaking. That's the part that makes it feel like a real conversation instead of a chatbot with a voice skin.

Chapters:

  1. Intro
  2. Audio IO
  3. Speech to Text (STT)
  4. Text to Speech (TTS)
  5. Full voice loop
  6. Real time systems
  7. Tools
  8. Personality
  9. Projects

Each chapter is a runnable script + a short CODE.md walkthrough. There's also a small shared library so you can see how the pieces compose into a real system, not just isolated calls.

Why fully local matters here: you can actually see where latency lives. Warm-up, first-audio time, streaming chunk size - these aren't abstractions when you're running it on your own machine.

I plan a deployment chapter, thinking of using modal.com for it, wishes and suggestions are welcome.

Repo: https://github.com/pguso/voice-agents-from-scratch

I originally wanted to publish this repo using Node.js, but the ecosystem in Node.js is really not ready. There is a very good Kokoro-JS npm package, but when it comes to Whisper support or audio processing in general there are no good options.

Happy to answer questions about the architecture or tradeoffs I ran into.

submitted by /u/purellmagents
[link] [comments]