[P] Quantized on-device models beat Whisper Large v3 (FP16) — LALM vs transducer, 15k inference tests, fully reproducible

Reddit r/MachineLearning / 3/21/2026

📰 NewsTools & Practical UsageModels & Research

共有:

Key Points

The results come from speech-swift, an open-source on-device speech AI library, benchmarking Whisper Large v3 (FP16) on LibriSpeech test-clean with a fully reproducible workflow and a 15-minute test on M2 Max.
Qwen3-ASR 1.7B 8-bit follows the Large Audio-Language Model paradigm with an LLM decoder to resolve acoustic ambiguity from language context, achieving 2.35% WER and beating Whisper's 2.7% while being about 26% smaller and 13% more accurate.
Qwen3-ASR 0.6B 8-bit achieves 2.80% WER with 600M parameters, about 40% of Whisper's parameter count.
Parakeet TDT INT8 achieves 2.74% WER with a 634 MB CoreML model running on Apple Neural Engine, using a non-autoregressive transducer that maps encoder frames directly to tokens without an autoregressive loop.
A multilingual caveat: 4-bit quantization is catastrophic for non-English languages (e.g., Korean WER jumps from 6.89% with 8-bit to 19.95% with 4-bit), so avoid 4-bit for non-English deployments.

We benchmarked two architectures against Whisper Large v3 (FP16) on LibriSpeech test-clean (2,620 utterances) as part of speech-swift, an open-source Swift library for on-device speech AI.

Results:

- Qwen3-ASR 1.7B 8-bit: 2.35% WER (vs Whisper's 2.7%) — 26% smaller, 13% more accurate

- Qwen3-ASR 0.6B 8-bit: 2.80% WER — 600M params, 40% of Whisper's parameter count

- Parakeet TDT INT8: 2.74% WER — 634 MB CoreML model on the Neural Engine

Two different architectural reasons:

Qwen3-ASR follows the Large Audio-Language Model (LALM) paradigm — Qwen3 LLM decoder instead of Whisper's cross-attention decoder. The LLM decoder resolves acoustic ambiguity from language context rather than speech statistics alone, and is confident enough that greedy decoding matches beam search accuracy. The AuT encoder was pretrained on ~40M hours — roughly 60x Whisper's training data.
Parakeet TDT is a non-autoregressive transducer — TDT joint network maps encoder frames directly to tokens, no autoregressive loop, no generative hallucination possible by design.

Multilingual note: 4-bit quantization is catastrophic for non-English. Korean goes from 6.89% (8-bit) to 19.95% (4-bit) WER on FLEURS — a 65% error increase. English barely changes. If you serve non-English users, don't use 4-bit.

All numbers reproducible — benchmark script in the repo, takes 15 minutes on M2 Max.

Article (architecture breakdown + full benchmarks): https://blog.ivan.digital/we-beat-whisper-large-v3-with-a-600m-model-running-entirely-on-your-mac-20e6ce191174

Library: github.com/soniqo/speech-swift

submitted by /u/ivan_digital
[link] [comments]