I benchmarked 42 STT models on medical audio with a new Medical WER metric — the leaderboard completely reshuffled

Reddit r/LocalLLaMA / 4/10/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

Updates a medical speech-to-text benchmark from 31 to 42 models and introduces a clinically focused “Medical WER (M-WER)” plus a medication-only “Drug M-WER” to better reflect patient-safety relevance.
Explains that standard WER overweights filler/low-importance words and treats all tokens equally, so the new metric re-scores only clinically relevant reference tokens.
Reports that the leaderboard order changes substantially: VibeVoice-ASR 9B moves to #3 on M-WER, while Parakeet TDT 0.6B v3 drops to #31 due largely to weak drug-name performance.
Highlights that small local models and cloud APIs both perform competitively under M-WER, with Qwen3-ASR 1.7B showing strong results and vendors like Soniox, AssemblyAI, and Deepgram ranking well.
Notes that the code, transcripts, per-file metrics, and the full leaderboard are open-source on GitHub, enabling others to reproduce and extend the evaluation.

I benchmarked 42 STT models on medical audio with a new Medical WER metric — the leaderboard completely reshuffled

TL;DR: I updated my medical speech-to-text benchmark to 42 models (up from 31 in v3) and added a new metric: Medical WER (M-WER).

Standard WER treats every word equally. In medical audio, that makes little sense — “yeah” and “amoxicillin” do not carry the same importance.

So for v4 I re-scored the benchmark using only clinically relevant words: drugs, conditions, symptoms, anatomy, and clinical procedures. I also broke out Drug M-WER separately, since medication names are where patient-safety risk gets real.

That change reshuffled the leaderboard hard.

A few notable results:

VibeVoice-ASR 9B ranks #3 on M-WER and beats Microsoft’s own new closed MAI-Transcribe-1, which lands at #11
Parakeet TDT 0.6B v3 drops from a strong overall-WER position to #31 on M-WER because of weak drug-name performance
Qwen3-ASR 1.7B is the most interesting small local model this round: 4.40% M-WER and about 7s/file on A10
Cloud APIs were stronger than I expected: Soniox, AssemblyAI Universal-3 Pro, and Deepgram Nova-3 Medical all ended up genuinely competitive

All code, transcripts, per-file metrics, and the full leaderboard are open-source on GitHub.

Previous posts: v1 · v2 · v3

What changed since v3

1. New headline metric: Medical WER (M-WER)

Standard WER is still useful, but in a doctor-patient conversation it overweights the wrong things. A missed filler word and a missed medication name both count as one error, even though only one is likely to matter clinically.

So for v4 I added:

M-WER = WER computed only over medically relevant reference tokens
Drug M-WER = same idea, but restricted to drug names only

The current vocabulary covers 179 terms across 5 categories:

drugs
conditions
symptoms
anatomy
clinical procedures

The reshuffle is real. Parakeet TDT 0.6B v3 looked great on normal WER in v3, but on M-WER it falls to #31, with 22% Drug M-WER. Great at conversational glue, much weaker on the words that actually carry clinical meaning.

2. 11 new models added (31 → 42)

This round added a bunch of new serious contenders:

Soniox stt-async-v4 → #4 on M-WER
AssemblyAI Universal-3 Pro (domain: medical-v1) → #7
Deepgram Nova-3 Medical → #9
Microsoft MAI-Transcribe-1 → #11
Qwen3-ASR 1.7B → #8, best small open-source model this round
Cohere Transcribe (Mar 2026) → #18, extremely fast
Parakeet TDT 1.1B → #15
Facebook MMS-1B-all → #42 dead last on this dataset

Also added a separate multi-speaker track with Multitalker Parakeet 0.6B using cpWER, since joint ASR + diarization is a different evaluation problem.

Top 20 by Medical WER

Dataset: PriMock57 — 55 doctor-patient consultations, ~80K words of British English medical dialogue.

#	Model	WER	M-WER	Drug M-WER	Speed	Host
1	Google Gemini 3 Pro Preview	8.35%	2.65%	3.1%	64.5s	API
2	Google Gemini 2.5 Pro	8.15%	2.97%	4.1%	56.4s	API
3	VibeVoice-ASR 9B (Microsoft, open-source)	8.34%	3.16%	5.6%	96.7s	H100
4	Soniox stt-async-v4	9.18%	3.32%	7.1%	46.2s	API
5	Google Gemini 3 Flash Preview	11.33%	3.64%	5.2%	51.5s	API
6	ElevenLabs Scribe v2	9.72%	3.86%	4.3%	43.5s	API
7	AssemblyAI Universal-3 Pro (medical-v1)	9.55%	4.02%	6.5%	37.3s	API
8	Qwen3 ASR 1.7B (open-source)	9.00%	4.40%	8.6%	6.8s	A10
9	Deepgram Nova-3 Medical	9.05%	4.53%	9.7%	12.9s	API
10	OpenAI GPT-4o Mini Transcribe (Dec '25)	11.18%	4.85%	10.6%	40.4s	API
11	Microsoft MAI-Transcribe-1	11.52%	4.85%	11.2%	21.8s	API
12	ElevenLabs Scribe v1	10.87%	4.88%	7.5%	36.3s	API
13	Google Gemini 2.5 Flash	9.45%	5.01%	10.3%	20.2s	API
14	Voxtral Mini Transcribe V1	11.85%	5.17%	11.0%	22.4s	API
15	Parakeet TDT 1.1B	9.03%	5.20%	15.5%	12.3s	T4
16	Voxtral Mini Transcribe V2	11.64%	5.36%	12.1%	18.4s	API
17	Voxtral Mini 4B Realtime	11.89%	5.39%	11.8%	270.9s	A10
18	Cohere Transcribe (Mar 2026)	11.81%	5.59%	16.6%	3.9s	A10
19	OpenAI Whisper-1	13.20%	5.62%	10.3%	104.3s	API
20	Groq Whisper Large v3 Turbo	12.14%	5.75%	14.4%	8.0s	API

Full 42-model leaderboard on GitHub.

The funny part: Microsoft vs Microsoft

Microsoft now has two visible STT offerings in this benchmark:

VibeVoice-ASR 9B — open-source, from Microsoft Research
MAI-Transcribe-1 — closed, newly shipped by Microsoft's new SuperIntelligence team available through Azure Foundry.

And on the metric that actually matters for medical voice, the open model wins clearly:

VibeVoice-ASR 9B → #3, 3.16% M-WER
MAI-Transcribe-1 → #11, 4.85% M-WER

So Microsoft’s own open-source release beats Microsoft’s flagship closed STT product by:

1.7 absolute points of M-WER
5.6 absolute points of Drug M-WER

VibeVoice is very good, but it is also heavy: 9B params, long inference, and we ran it on H100 96GB. So it wins on contextual medical accuracy, but not on deployability.

Best small open-source model: Qwen3-ASR 1.7B

This is probably the most practically interesting open-source result in the whole board.

Qwen3-ASR 1.7B lands at:

9.00% WER
4.40% M-WER
8.6% Drug M-WER
about 6.8s/file on A10

That is a strong accuracy-to-cost tradeoff.

It is much faster than VibeVoice, much smaller, and still good enough on medical terms that I think a lot of people building local or semi-local clinical voice stacks will care more about this result than the #1 spot.

One important deployment caveat: Qwen3-ASR does not play nicely with T4. The model path wants newer attention support and ships in bf16, so A10 or better is the realistic target.

There was also a nasty long-audio bug in the default vLLM setup: Qwen3 would silently hang on longer files. The practical fix was:

max_num_batched_tokens=16384

That one-line change fixed it for us. Full notes are in the repo’s AGENTS.md.

Cloud APIs got serious this round

v3 was still mostly a Google / ElevenLabs / OpenAI / Mistral story.

v4 broadened that a lot:

Soniox (#4) — impressive for a universal model without explicit medical specialization
AssemblyAI Universal-3 Pro (#7) — very solid, especially with medical-v1
Deepgram Nova-3 Medical (#9) — fastest serious cloud API in the top group
Microsoft MAI-Transcribe-1 (#11) — weaker than I expected, but still competitive

Google still dominates the very top, but the broader takeaway is different:

the gap between strong cloud APIs and strong open-source models is now small enough that deployment constraints matter more than ever.

How M-WER is computed

The implementation is simple on purpose:

Tag medically relevant words in the reference transcript
Run normal WER alignment between reference and hypothesis
Count substitutions / deletions / insertions only on those tagged medical tokens
Compute:
- M-WER over all medical tokens
- Drug M-WER over the drug subset only

Current vocab:

179 medical terms
5 categories
464 drug-term occurrences in PriMock57

The vocabulary file is in evaluate/medical_terms_list.py and is easy to extend.

I benchmarked 42 STT models on medical audio with a new Medical WER metric — the leaderboard completely reshuffled

Key Points

What changed since v3

1. New headline metric: Medical WER (M-WER)

2. 11 new models added (31 → 42)

Top 20 by Medical WER

The funny part: Microsoft vs Microsoft

Best small open-source model: Qwen3-ASR 1.7B

Cloud APIs got serious this round

How M-WER is computed

Links

Related Articles

Black Hat USA

Black Hat Asia

Meta Superintelligence Lab Releases Muse Spark: A Multimodal Reasoning Model With Thought Compression and Parallel Agents

Chatbots are great at manipulating people to buy stuff, Princeton boffins find

I tested and ranked every ai companion app I tried and here's my honest breakdown

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer