| I'm building a real-time speech translator (STT → LLM translation → TTS) and spent a couple weeks benchmarking every TTS engine I could find — cloud and local. Running on MacBook Air M4, 24GB RAM. Some findings were... not what I expected. Sharing everything because I couldn't find this data anywhere when I started. The setupPipeline: Deepgram Nova-3 (STT, ~300ms) → Groq Llama 3.3 70B (translation, ~200ms) → TTS → speaker The TTS component is the bottleneck. STT and LLM together take ~500ms. If TTS adds another second, the conversation feels like a walkie-talkie. Local TTS benchmarks (Apple M4, warm, same phrases)
Piper is fastest but sounds like a robot from 2015 (and the project got archived Oct 2025). Kokoro 82M is the sweet spot — A+ quality at 370ms for short chunks. Everything above 200M parameters is basically unusable for real-time on Mac. The quantization surprise (this one hurt)Tried to speed up Kokoro on M4:
INT8 is almost 2x slower than fp16 on Apple Silicon. ARM chips are optimized for fp16 ops. Quantization saves RAM but adds type conversion overhead. Burned a full day on this because nothing in the docs mentions it. CoreML doesn't work either — only 37 of 2493 model nodes are supported by the CoreML EP. MLX is also not faster for short texts. PyTorch CPU was paradoxically faster than MLX for short phrases (98ms vs 364ms for 6 chars) due to MLX graph compilation overhead. Cloud TTS: Protocol matters more than providerThis was the biggest shock. Same provider, same model, same text:
Cartesia WebSocket vs sync = 5.5x difference. If you're benchmarking TTS providers with their sync SDK, you're measuring the wrong thing. The cost problem
ElevenLabs is 4-20x more expensive than alternatives with comparable quality. At 1,000 hours/month that's a $5,310 difference. What I ended up withDeepgram Nova-3 → Groq Llama 3.3 70B → StreamChunker (splits into 2-3 word chunks) → Kokoro 82M Total latency to first audio: ~870ms. Google Meet S2ST is ~2000ms. Palabra.ai is ~800ms at $25+/mo. Going open-source soon. The translator runs on Elixir + Rust + Flask. TL;DR
I wrote a longer piece with all 30+ providers, ELO rankings, and detailed per-phrase benchmarks if anyone wants the full data: https://ai.gopubby.com/i-benchmarked-30-voice-ai-engines-and-built-a-real-time-translator-faster-than-google-meet-e6a160def969 Happy to answer questions about any specific provider or setup. [link] [comments] |
I benchmarked 30+ TTS engines for a real-time translator on Apple M4. Quantization made things SLOWER. Here's all the data.
Reddit r/LocalLLaMA / 4/15/2026
💬 OpinionIdeas & Deep AnalysisTools & Practical Usage
Key Points
- The author built a real-time speech translation pipeline (Deepgram Nova-3 STT → Groq Llama 3.3 70B translation → TTS) and found the TTS stage is the primary latency bottleneck for conversational feel.
- Benchmarks on an Apple M4 MacBook Air show Piper ryan-med is fastest but lower quality, while Kokoro 82M (fp16) delivers the best real-time tradeoff with ~370ms for short chunks and A+ quality.
- The tests indicate that TTS models larger than ~200M parameters are effectively unusable for real-time operation on the given Mac hardware.
- A key finding is that quantizing Kokoro on the M4 made performance worse (INT8: ~687ms vs fp16: ~373ms), contradicting the expectation that quantization would speed up inference.
- Some platform-specific optimizations (e.g., CoreML Neural Engine) were not supported for the tested model architecture, and using a single thread drastically increased latency (~1723ms).




