We benchmarked two architectures against Whisper Large v3 (FP16) on LibriSpeech test-clean (2,620 utterances) as part of speech-swift, an open-source Swift library for on-device speech AI.
Results:
- Qwen3-ASR 1.7B 8-bit: 2.35% WER (vs Whisper's 2.7%) — 26% smaller, 13% more accurate
- Qwen3-ASR 0.6B 8-bit: 2.80% WER — 600M params, 40% of Whisper's parameter count
- Parakeet TDT INT8: 2.74% WER — 634 MB CoreML model on the Neural Engine
Two different architectural reasons:
Qwen3-ASR follows the Large Audio-Language Model (LALM) paradigm — Qwen3 LLM decoder instead of Whisper's cross-attention decoder. The LLM decoder resolves acoustic ambiguity from language context rather than speech statistics alone, and is confident enough that greedy decoding matches beam search accuracy. The AuT encoder was pretrained on ~40M hours — roughly 60x Whisper's training data.
Parakeet TDT is a non-autoregressive transducer — TDT joint network maps encoder frames directly to tokens, no autoregressive loop, no generative hallucination possible by design.
Multilingual note: 4-bit quantization is catastrophic for non-English. Korean goes from 6.89% (8-bit) to 19.95% (4-bit) WER on FLEURS — a 65% error increase. English barely changes. If you serve non-English users, don't use 4-bit.
All numbers reproducible — benchmark script in the repo, takes 15 minutes on M2 Max.
Article (architecture breakdown + full benchmarks): https://blog.ivan.digital/we-beat-whisper-large-v3-with-a-600m-model-running-entirely-on-your-mac-20e6ce191174
Library: github.com/soniqo/speech-swift
[link] [comments]




