I built a sovereign voice layer that routes to 11 AI providers — here's the architecture

Dev.to / 4/30/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The author describes BRAGI, a locally running “sovereign voice layer” that uses a wake word, local speech-to-text, and then routes the resulting text to one of multiple AI providers (cloud or local).
  • The architecture keeps audio from leaving the machine, while only the transcribed text is sent to the selected provider, enabling easier switching and avoiding permanent lock-in to a single assistant.
  • BRAGI’s pipeline combines openwakeword for CPU-based wake detection, faster-whisper (medium on CUDA) for low-latency bilingual STT, and a TTS stage using eSpeak or OpenAI Nova (BYOK).
  • The post highlights practical engineering lessons from shipping v0.2, including wake-word model compatibility issues across openwakeword preprocessors and the need to warm up Whisper once at startup to avoid unacceptable per-request latency.
  • A key challenge was building a provider router that normalizes differing SDKs, streaming formats, and authentication schemes behind a single interface with readiness checks.

After two years of bouncing between Claude desktop, ChatGPT voice, Gemini, and a half-dozen Ollama frontends, I got tired of the wake-word thrash. Every assistant assumes you've picked their team forever.

So I built BRAGI — a voice layer that runs locally, listens locally, and routes to whichever AI I tell it to. Including the one running on the same machine.

This post is the architecture, not a sales pitch. If you've been thinking about building something similar, here's what I learned shipping v0.2.

The pipeline

Mic input

openwakeword (local) — "Hey Jarvis"

faster-whisper medium (local, GPU optional)

Provider router (settings UI picks destination)

[Cloud: Claude / OpenAI / Gemini / Grok / Groq / Together / HuggingFace]
[Local: Ollama / LM Studio / FREYA / Echo]

TTS (eSpeak free, OpenAI Nova BYOK)

Speaker output
Audio never leaves the machine. Only transcribed text goes to whichever cloud you picked, if any.

Wake word

openwakeword is the right call for a sovereign product. Picovoice is better quality but locks you into a paid commercial license. openwakeword is Apache 2.0 and runs on CPU.

The catch: training your own custom model requires matching the feature dimensions to whichever preprocessor version you're targeting. I burned half a day on a model that had 96×103 features when openwakeword expected 32×147. v0.2 ships with the stock "Hey Jarvis" model and includes the custom "Hey BRAGI" model for users with compatible hardware.

STT

faster-whisper medium on CUDA is the sweet spot. Tiny is too inaccurate for real conversation, large is overkill for short voice commands. Medium gets ~1 second latency on a midrange GPU and handles bilingual input out of the box.

Critical detail: instantiate Whisper once at startup, never per-request. First inference call takes 5-10 seconds to warm CUDA. Users won't tolerate that on every wake.

The router

This was the hardest part. Each provider has a different SDK, different streaming format, different auth pattern. The router abstracts that into one interface:

class Provider(Protocol):
    def name(self) -> str: ...
    def is_ready(self) -> bool: ...
    async def respond(self, prompt: str, history: list[Message]) -> AsyncIterator[str]: ...

Each provider implementation handles its own SDK quirks. The router just picks one based on user settings or voice command ("BRAGI, switch to Claude") and calls respond().

For local models I support both Ollama (HTTP API) and LM Studio (OpenAI-compatible HTTP API). Both run on the user's machine. Both look identical to the router.

TTS

eSpeak ships with the installer because it's free, offline, and 100+ languages. It sounds robotic. That's fine. People who want premium voice can paste an OpenAI API key and use Nova.

I tried Kokoro for higher-quality offline TTS. Worked great in dev. Production builds kept hitting a 404 on the default voice file in HuggingFace. Shipped with eSpeak as the default and Kokoro as best-effort.

The settings UI

Local web UI on http://127.0.0.1:7777. Configure providers, paste API keys, pick voices, manage license. Page lives on the user's machine. No account, no login, no cloud dashboard.

API keys live in a local vault. They never leave the machine. The product is sovereignty — that has to be true at every layer.

Stack

  • Python 3.11
  • openwakeword for wake detection
  • faster-whisper for STT
  • eSpeak / OpenAI Nova for TTS
  • FastAPI for the local settings server
  • pythonw.exe in tray mode for daily use
  • PyInstaller for bundling
  • NSIS for the Windows installer
  • ~169MB installer, Win10/11

What I'd do differently

  1. Custom wake word training is harder than the docs admit. openwakeword's preprocessor is versioned and the feature dims have to match exactly. Document this for users who want to train their own.

  2. PyInstaller + 4GB CUDA torch builds blow past NSIS's 2GB single-file limit. I had to move torch + Kokoro to a first-run download instead of bundling them.

  3. Don't trust the embedded Python's python311._pth defaults. User-site contamination from %APPDATA%\Roaming\Python will silently break your install. Always launch with -s -E flags.

What's next

v0.3 will likely add: better Kokoro fallback, custom wake word training UI, multi-room concurrency. The architecture supports it — I just need to ship v0.2 first and see what users actually ask for.

If you want to see it: clintwave84.gumroad.com/l/leetkd

If you've built something similar and want to compare notes — drop a comment. Especially curious how others have handled the provider abstraction across cloud + local.

— Built by one guy in Idaho. Snake River AI.