I benchmarked 42 STT models on medical audio with a new Medical WER metric — the leaderboard completely reshuffled

Reddit r/LocalLLaMA / 4/10/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • Updates a medical speech-to-text benchmark from 31 to 42 models and introduces a clinically focused “Medical WER (M-WER)” plus a medication-only “Drug M-WER” to better reflect patient-safety relevance.
  • Explains that standard WER overweights filler/low-importance words and treats all tokens equally, so the new metric re-scores only clinically relevant reference tokens.
  • Reports that the leaderboard order changes substantially: VibeVoice-ASR 9B moves to #3 on M-WER, while Parakeet TDT 0.6B v3 drops to #31 due largely to weak drug-name performance.
  • Highlights that small local models and cloud APIs both perform competitively under M-WER, with Qwen3-ASR 1.7B showing strong results and vendors like Soniox, AssemblyAI, and Deepgram ranking well.
  • Notes that the code, transcripts, per-file metrics, and the full leaderboard are open-source on GitHub, enabling others to reproduce and extend the evaluation.
I benchmarked 42 STT models on medical audio with a new Medical WER metric — the leaderboard completely reshuffled

TL;DR: I updated my medical speech-to-text benchmark to 42 models (up from 31 in v3) and added a new metric: Medical WER (M-WER).

Standard WER treats every word equally. In medical audio, that makes little sense — “yeah” and “amoxicillin” do not carry the same importance.

So for v4 I re-scored the benchmark using only clinically relevant words: drugs, conditions, symptoms, anatomy, and clinical procedures. I also broke out Drug M-WER separately, since medication names are where patient-safety risk gets real.

That change reshuffled the leaderboard hard.

A few notable results:

  • VibeVoice-ASR 9B ranks #3 on M-WER and beats Microsoft’s own new closed MAI-Transcribe-1, which lands at #11
  • Parakeet TDT 0.6B v3 drops from a strong overall-WER position to #31 on M-WER because of weak drug-name performance
  • Qwen3-ASR 1.7B is the most interesting small local model this round: 4.40% M-WER and about 7s/file on A10
  • Cloud APIs were stronger than I expected: Soniox, AssemblyAI Universal-3 Pro, and Deepgram Nova-3 Medical all ended up genuinely competitive

All code, transcripts, per-file metrics, and the full leaderboard are open-source on GitHub.

Previous posts: v1 · v2 · v3

What changed since v3

1. New headline metric: Medical WER (M-WER)

Standard WER is still useful, but in a doctor-patient conversation it overweights the wrong things. A missed filler word and a missed medication name both count as one error, even though only one is likely to matter clinically.

So for v4 I added:

  • M-WER = WER computed only over medically relevant reference tokens
  • Drug M-WER = same idea, but restricted to drug names only

The current vocabulary covers 179 terms across 5 categories:

  • drugs
  • conditions
  • symptoms
  • anatomy
  • clinical procedures

The reshuffle is real. Parakeet TDT 0.6B v3 looked great on normal WER in v3, but on M-WER it falls to #31, with 22% Drug M-WER. Great at conversational glue, much weaker on the words that actually carry clinical meaning.

2. 11 new models added (31 → 42)

This round added a bunch of new serious contenders:

  • Soniox stt-async-v4#4 on M-WER
  • AssemblyAI Universal-3 Pro (domain: medical-v1) → #7
  • Deepgram Nova-3 Medical#9
  • Microsoft MAI-Transcribe-1#11
  • Qwen3-ASR 1.7B#8, best small open-source model this round
  • Cohere Transcribe (Mar 2026)#18, extremely fast
  • Parakeet TDT 1.1B#15
  • Facebook MMS-1B-all#42 dead last on this dataset

Also added a separate multi-speaker track with Multitalker Parakeet 0.6B using cpWER, since joint ASR + diarization is a different evaluation problem.

Top 20 by Medical WER

Dataset: PriMock57 — 55 doctor-patient consultations, ~80K words of British English medical dialogue.

# Model WER M-WER Drug M-WER Speed Host
1 Google Gemini 3 Pro Preview 8.35% 2.65% 3.1% 64.5s API
2 Google Gemini 2.5 Pro 8.15% 2.97% 4.1% 56.4s API
3 VibeVoice-ASR 9B (Microsoft, open-source) 8.34% 3.16% 5.6% 96.7s H100
4 Soniox stt-async-v4 9.18% 3.32% 7.1% 46.2s API
5 Google Gemini 3 Flash Preview 11.33% 3.64% 5.2% 51.5s API
6 ElevenLabs Scribe v2 9.72% 3.86% 4.3% 43.5s API
7 AssemblyAI Universal-3 Pro (medical-v1) 9.55% 4.02% 6.5% 37.3s API
8 Qwen3 ASR 1.7B (open-source) 9.00% 4.40% 8.6% 6.8s A10
9 Deepgram Nova-3 Medical 9.05% 4.53% 9.7% 12.9s API
10 OpenAI GPT-4o Mini Transcribe (Dec '25) 11.18% 4.85% 10.6% 40.4s API
11 Microsoft MAI-Transcribe-1 11.52% 4.85% 11.2% 21.8s API
12 ElevenLabs Scribe v1 10.87% 4.88% 7.5% 36.3s API
13 Google Gemini 2.5 Flash 9.45% 5.01% 10.3% 20.2s API
14 Voxtral Mini Transcribe V1 11.85% 5.17% 11.0% 22.4s API
15 Parakeet TDT 1.1B 9.03% 5.20% 15.5% 12.3s T4
16 Voxtral Mini Transcribe V2 11.64% 5.36% 12.1% 18.4s API
17 Voxtral Mini 4B Realtime 11.89% 5.39% 11.8% 270.9s A10
18 Cohere Transcribe (Mar 2026) 11.81% 5.59% 16.6% 3.9s A10
19 OpenAI Whisper-1 13.20% 5.62% 10.3% 104.3s API
20 Groq Whisper Large v3 Turbo 12.14% 5.75% 14.4% 8.0s API

Full 42-model leaderboard on GitHub.

The funny part: Microsoft vs Microsoft

Microsoft now has two visible STT offerings in this benchmark:

  • VibeVoice-ASR 9B — open-source, from Microsoft Research
  • MAI-Transcribe-1 — closed, newly shipped by Microsoft's new SuperIntelligence team available through Azure Foundry.

And on the metric that actually matters for medical voice, the open model wins clearly:

  • VibeVoice-ASR 9B#3, 3.16% M-WER
  • MAI-Transcribe-1#11, 4.85% M-WER

So Microsoft’s own open-source release beats Microsoft’s flagship closed STT product by:

  • 1.7 absolute points of M-WER
  • 5.6 absolute points of Drug M-WER

VibeVoice is very good, but it is also heavy: 9B params, long inference, and we ran it on H100 96GB. So it wins on contextual medical accuracy, but not on deployability.

Best small open-source model: Qwen3-ASR 1.7B

This is probably the most practically interesting open-source result in the whole board.

Qwen3-ASR 1.7B lands at:

  • 9.00% WER
  • 4.40% M-WER
  • 8.6% Drug M-WER
  • about 6.8s/file on A10

That is a strong accuracy-to-cost tradeoff.

It is much faster than VibeVoice, much smaller, and still good enough on medical terms that I think a lot of people building local or semi-local clinical voice stacks will care more about this result than the #1 spot.

One important deployment caveat: Qwen3-ASR does not play nicely with T4. The model path wants newer attention support and ships in bf16, so A10 or better is the realistic target.

There was also a nasty long-audio bug in the default vLLM setup: Qwen3 would silently hang on longer files. The practical fix was:

max_num_batched_tokens=16384 

That one-line change fixed it for us. Full notes are in the repo’s AGENTS.md.

Cloud APIs got serious this round

v3 was still mostly a Google / ElevenLabs / OpenAI / Mistral story.

v4 broadened that a lot:

  • Soniox (#4) — impressive for a universal model without explicit medical specialization
  • AssemblyAI Universal-3 Pro (#7) — very solid, especially with medical-v1
  • Deepgram Nova-3 Medical (#9) — fastest serious cloud API in the top group
  • Microsoft MAI-Transcribe-1 (#11) — weaker than I expected, but still competitive

Google still dominates the very top, but the broader takeaway is different:

the gap between strong cloud APIs and strong open-source models is now small enough that deployment constraints matter more than ever.

How M-WER is computed

The implementation is simple on purpose:

  1. Tag medically relevant words in the reference transcript
  2. Run normal WER alignment between reference and hypothesis
  3. Count substitutions / deletions / insertions only on those tagged medical tokens
  4. Compute:
    • M-WER over all medical tokens
    • Drug M-WER over the drug subset only

Current vocab:

  • 179 medical terms
  • 5 categories
  • 464 drug-term occurrences in PriMock57

The vocabulary file is in evaluate/medical_terms_list.py and is easy to extend.

Links

Happy to take questions, criticism on the metric design, or suggestions for v5.

submitted by /u/MajesticAd2862
[link] [comments]