| TL;DR: I updated my medical speech-to-text benchmark to 42 models (up from 31 in v3) and added a new metric: Medical WER (M-WER). Standard WER treats every word equally. In medical audio, that makes little sense — “yeah” and “amoxicillin” do not carry the same importance. So for v4 I re-scored the benchmark using only clinically relevant words: drugs, conditions, symptoms, anatomy, and clinical procedures. I also broke out Drug M-WER separately, since medication names are where patient-safety risk gets real. That change reshuffled the leaderboard hard. A few notable results:
All code, transcripts, per-file metrics, and the full leaderboard are open-source on GitHub. What changed since v31. New headline metric: Medical WER (M-WER)Standard WER is still useful, but in a doctor-patient conversation it overweights the wrong things. A missed filler word and a missed medication name both count as one error, even though only one is likely to matter clinically. So for v4 I added:
The current vocabulary covers 179 terms across 5 categories:
The reshuffle is real. Parakeet TDT 0.6B v3 looked great on normal WER in v3, but on M-WER it falls to #31, with 22% Drug M-WER. Great at conversational glue, much weaker on the words that actually carry clinical meaning. 2. 11 new models added (31 → 42)This round added a bunch of new serious contenders:
Also added a separate multi-speaker track with Multitalker Parakeet 0.6B using cpWER, since joint ASR + diarization is a different evaluation problem. Top 20 by Medical WERDataset: PriMock57 — 55 doctor-patient consultations, ~80K words of British English medical dialogue.
Full 42-model leaderboard on GitHub. The funny part: Microsoft vs MicrosoftMicrosoft now has two visible STT offerings in this benchmark:
And on the metric that actually matters for medical voice, the open model wins clearly:
So Microsoft’s own open-source release beats Microsoft’s flagship closed STT product by:
VibeVoice is very good, but it is also heavy: 9B params, long inference, and we ran it on H100 96GB. So it wins on contextual medical accuracy, but not on deployability. Best small open-source model: Qwen3-ASR 1.7BThis is probably the most practically interesting open-source result in the whole board. Qwen3-ASR 1.7B lands at:
That is a strong accuracy-to-cost tradeoff. It is much faster than VibeVoice, much smaller, and still good enough on medical terms that I think a lot of people building local or semi-local clinical voice stacks will care more about this result than the #1 spot. One important deployment caveat: Qwen3-ASR does not play nicely with T4. The model path wants newer attention support and ships in bf16, so A10 or better is the realistic target. There was also a nasty long-audio bug in the default vLLM setup: Qwen3 would silently hang on longer files. The practical fix was: That one-line change fixed it for us. Full notes are in the repo’s Cloud APIs got serious this roundv3 was still mostly a Google / ElevenLabs / OpenAI / Mistral story. v4 broadened that a lot:
Google still dominates the very top, but the broader takeaway is different: the gap between strong cloud APIs and strong open-source models is now small enough that deployment constraints matter more than ever. How M-WER is computedThe implementation is simple on purpose:
Current vocab:
The vocabulary file is in Links
Happy to take questions, criticism on the metric design, or suggestions for v5. [link] [comments] |
I benchmarked 42 STT models on medical audio with a new Medical WER metric — the leaderboard completely reshuffled
Reddit r/LocalLLaMA / 4/10/2026
💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- Updates a medical speech-to-text benchmark from 31 to 42 models and introduces a clinically focused “Medical WER (M-WER)” plus a medication-only “Drug M-WER” to better reflect patient-safety relevance.
- Explains that standard WER overweights filler/low-importance words and treats all tokens equally, so the new metric re-scores only clinically relevant reference tokens.
- Reports that the leaderboard order changes substantially: VibeVoice-ASR 9B moves to #3 on M-WER, while Parakeet TDT 0.6B v3 drops to #31 due largely to weak drug-name performance.
- Highlights that small local models and cloud APIs both perform competitively under M-WER, with Qwen3-ASR 1.7B showing strong results and vendors like Soniox, AssemblyAI, and Deepgram ranking well.
- Notes that the code, transcripts, per-file metrics, and the full leaderboard are open-source on GitHub, enabling others to reproduce and extend the evaluation.
Related Articles

Black Hat USA
AI Business

Black Hat Asia
AI Business

Meta Superintelligence Lab Releases Muse Spark: A Multimodal Reasoning Model With Thought Compression and Parallel Agents
MarkTechPost
Chatbots are great at manipulating people to buy stuff, Princeton boffins find
The Register
I tested and ranked every ai companion app I tried and here's my honest breakdown
Reddit r/artificial