I benchmarked 30+ TTS engines for a real-time translator on Apple M4. Quantization made things SLOWER. Here's all the data.

Reddit r/LocalLLaMA / 4/15/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical Usage

Key Points

  • The author built a real-time speech translation pipeline (Deepgram Nova-3 STT → Groq Llama 3.3 70B translation → TTS) and found the TTS stage is the primary latency bottleneck for conversational feel.
  • Benchmarks on an Apple M4 MacBook Air show Piper ryan-med is fastest but lower quality, while Kokoro 82M (fp16) delivers the best real-time tradeoff with ~370ms for short chunks and A+ quality.
  • The tests indicate that TTS models larger than ~200M parameters are effectively unusable for real-time operation on the given Mac hardware.
  • A key finding is that quantizing Kokoro on the M4 made performance worse (INT8: ~687ms vs fp16: ~373ms), contradicting the expectation that quantization would speed up inference.
  • Some platform-specific optimizations (e.g., CoreML Neural Engine) were not supported for the tested model architecture, and using a single thread drastically increased latency (~1723ms).
I benchmarked 30+ TTS engines for a real-time translator on Apple M4. Quantization made things SLOWER. Here's all the data.

I'm building a real-time speech translator (STT → LLM translation → TTS) and spent a couple weeks benchmarking every TTS engine I could find — cloud and local. Running on MacBook Air M4, 24GB RAM.

Some findings were... not what I expected. Sharing everything because I couldn't find this data anywhere when I started.

The setup

Pipeline: Deepgram Nova-3 (STT, ~300ms) → Groq Llama 3.3 70B (translation, ~200ms) → TTS → speaker

The TTS component is the bottleneck. STT and LLM together take ~500ms. If TTS adds another second, the conversation feels like a walkie-talkie.

Local TTS benchmarks (Apple M4, warm, same phrases)

Model Size 2-3 words 10 words Quality
Piper ryan-med 63MB 30-50ms 137ms B
Kokoro 82M fp16 156MB 370ms 730ms A+
pocket-tts 100M 260ms 7500ms! B
ZipVoice 123M 123M ~500ms 1240ms B+
Chatterbox 500M 500M 6310ms 9100ms A
Qwen3-TTS 0.6B 600M ~800ms 1600-2000ms B+
Qwen3-TTS 1.7B 1.7B ~2500ms 5300ms A

Piper is fastest but sounds like a robot from 2015 (and the project got archived Oct 2025). Kokoro 82M is the sweet spot — A+ quality at 370ms for short chunks.

Everything above 200M parameters is basically unusable for real-time on Mac.

The quantization surprise (this one hurt)

Tried to speed up Kokoro on M4:

Optimization Result Verdict
fp16 (default) 373ms Best
INT8 quantization 687ms 1.8x SLOWER
q8f16 655ms 1.75x SLOWER
CoreML Neural Engine error Architecture not supported
1 thread 1723ms
4 threads ~730ms Optimum
8 threads 754ms Overhead

INT8 is almost 2x slower than fp16 on Apple Silicon. ARM chips are optimized for fp16 ops. Quantization saves RAM but adds type conversion overhead. Burned a full day on this because nothing in the docs mentions it.

CoreML doesn't work either — only 37 of 2493 model nodes are supported by the CoreML EP.

MLX is also not faster for short texts. PyTorch CPU was paradoxically faster than MLX for short phrases (98ms vs 364ms for 6 chars) due to MLX graph compilation overhead.

Cloud TTS: Protocol matters more than provider

This was the biggest shock. Same provider, same model, same text:

Provider Protocol TTFB avg
Cartesia Sonic-2 WebSocket 245ms
Cartesia Sonic-2 sync SDK 1361ms
ElevenLabs Flash v2.5 WebSocket 395ms
Hume Octave 2 HTTP stream 800ms
Hume Octave 2 sync 2158ms

Cartesia WebSocket vs sync = 5.5x difference. If you're benchmarking TTS providers with their sync SDK, you're measuring the wrong thing.

The cost problem

Provider $/hour of voice bot
Hume Octave 2 $0.26
Inworld Mini $0.17
Cartesia Sonic $1.26
OpenAI TTS-1 $0.51
ElevenLabs Flash v2.5 $5.57

ElevenLabs is 4-20x more expensive than alternatives with comparable quality. At 1,000 hours/month that's a $5,310 difference.

What I ended up with

Deepgram Nova-3 → Groq Llama 3.3 70B → StreamChunker (splits into 2-3 word chunks) → Kokoro 82M

Total latency to first audio: ~870ms. Google Meet S2ST is ~2000ms. Palabra.ai is ~800ms at $25+/mo.

Going open-source soon. The translator runs on Elixir + Rust + Flask.

TL;DR

  • Kokoro 82M fp16 is the only viable local TTS for real-time on Mac (370ms, A+ quality)
  • Don't quantize on Apple Silicon — INT8 is 1.8x slower than fp16
  • CoreML and MLX don't help for short-text TTS inference
  • Always benchmark TTS over WebSocket, not sync API (5.5x difference)
  • ElevenLabs is 4-20x overpriced vs Cartesia/Hume/Inworld
  • Every serious new open-source TTS model is 0.5B+ params — unusable for real-time on Mac

I wrote a longer piece with all 30+ providers, ELO rankings, and detailed per-phrase benchmarks if anyone wants the full data: https://ai.gopubby.com/i-benchmarked-30-voice-ai-engines-and-built-a-real-time-translator-faster-than-google-meet-e6a160def969

Happy to answer questions about any specific provider or setup.

submitted by /u/Kir_Moisha
[link] [comments]