I benchmarked 31 STT models on medical audio — VibeVoice 9B is the new open-source leader at 8.34% WER, but it's big and slow

Reddit r/LocalLLaMA / 3/27/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The updated medical speech-to-text benchmark covers 31 STT models and finds Microsoft VibeVoice-ASR 9B as the new open-source leader with 8.34% WER, close to Gemini 2.5 Pro at 8.15%.
VibeVoice-ASR 9B’s accuracy comes with steep deployment costs: it requires ~18GB VRAM (tested on an H100) and is slow at about 97 seconds per file versus ~6 seconds for faster baselines like Parakeet.
The benchmark adds five new evaluated models, including ElevenLabs Scribe v2 and several Voxtral/NVIDIA streaming-oriented options, each with different tradeoffs in accuracy and hardware suitability.
The biggest methodological change is fixing Whisper’s EnglishTextNormalizer bugs, where issues like treating “oh” as zero and missing common medical word equivalences inflated WER across models by an estimated 2–3%.
All benchmark code and results are published as open-source, enabling others to reproduce and compare STT performance under medical audio conditions.

I benchmarked 31 STT models on medical audio — VibeVoice 9B is the new open-source leader at 8.34% WER, but it's big and slow

TL;DR: v3 of my medical speech-to-text benchmark. 31 models now (up from 26 in v2). Microsoft VibeVoice-ASR 9B takes the open-source crown at 8.34% WER, nearly matching Gemini 2.5 Pro (8.15%). But it's 9B params, needs ~18GB VRAM (ran it on an H100 since I had easy access, but an L4 or similar would work too), and even on H100 it's slow — 97s per file vs 6s for Parakeet. Also found bugs in Whisper's text normalizer that were inflating WER by 2-3% across every model. All code + results are open-source.

Previous posts: v1 — 15 models | v2 — 26 models

What changed since v2

5 new models added (26 → 31):

Microsoft VibeVoice-ASR 9B — new open-source leader (8.34% WER), but needs ~18GB VRAM (won't fit on T4). I ran it on H100 since I had access, but an L4 or A10 would work too. Even on H100 it's slow at 97s/file.
ElevenLabs Scribe v2 — solid upgrade over v1 (9.72% vs 10.87%)
NVIDIA Nemotron Speech Streaming 0.6B — decent edge option at 11.06% on T4
Voxtral Mini 2602 via Transcription API (11.64%)
Voxtral Mini 4B via vLLM realtime (11.89% on H100, 693s on T4 — designed for streaming, not batch)

Also evaluated LiquidAI's LFM2.5-Audio-1.5B and Meta's SeamlessM4T v2 Large, but neither was suitable for this benchmark (more below in takeaways).

Replaced Whisper's normalizer with a custom one. This is the bigger deal. Found two bugs in Whisper's EnglishTextNormalizer that were quietly inflating WER:

"oh" treated as zero — Whisper has self.zeros = {"o", "oh", "zero"}. In medical conversations, "oh" is always an interjection ("oh, my back hurts"), never the digit. This alone created thousands of false substitution errors.
Missing word equivalences — ok/okay/k, yeah/yep/yes, mum/mom, alright/all right, kinda/kind of. Whisper doesn't normalize these to the same form, so every variant counted as an error.

Combined, these bugs inflated WER by ~2-3% across ALL models. Every score in v3 is recalculated with the custom normalizer. Code is in evaluate/text_normalizer.py — drop-in replacement, no whisper dependency needed.

Top 15 Leaderboard

Dataset: PriMock57 — 55 doctor-patient consultations, ~80K words of British English medical dialogue.

Rank	Model	WER	Speed (avg/file)	Runs on
1	Gemini 2.5 Pro	8.15%	56s	API
2	VibeVoice-ASR 9B	8.34%	97s	H100
3	Gemini 3 Pro Preview	8.35%	65s	API
4	Parakeet TDT 0.6B v3	9.35%	6s	Apple Silicon
5	Gemini 2.5 Flash	9.45%	20s	API
6	ElevenLabs Scribe v2	9.72%	44s	API
7	Parakeet TDT 0.6B v2	10.75%	5s	Apple Silicon
8	ElevenLabs Scribe v1	10.87%	36s	API
9	Nemotron Speech Streaming 0.6B	11.06%	12s	T4
10	GPT-4o Mini (2025-12-15)	11.18%	40s	API
11	Kyutai STT 2.6B	11.20%	148s	GPU
12	Gemini 3 Flash Preview	11.33%	52s	API
13	Voxtral Mini 2602 (Transcription API)	11.64%	18s	API
14	MLX Whisper Large v3 Turbo	11.65%	13s	Apple Silicon
15	Mistral Voxtral Mini	11.85%	22s	API

Full 31-model leaderboard (including the bottom half with Granite, Phi-4, MedASR etc.) on GitHub.

Key takeaways

VibeVoice is legit — but heavy and slow. At 9B params it's the first open-source model to genuinely compete with Gemini-tier cloud APIs on medical audio. Needs ~18GB VRAM (won't fit on T4, but doesn't need an H100 either — L4/A10 should work). Even on H100 though, 97s per file is slow compared to other local models.

Parakeet TDT 0.6B v3 is the real edge story. 9.35% WER at 6 seconds per file on Apple Silicon. A 0.6B model getting within 1% of a 9B model.

ElevenLabs Scribe v2 is a meaningful upgrade. 9.72% vs 10.87% for v1. Best cloud API option if you don't want to go Google.

LFM Audio and SeamlessM4T didn't make the cut. LFM2.5-Audio-1.5B isn't a dedicated ASR model — transcription is a secondary capability via prompting. With recommended 2s chunks: sparse keyword extractions (~74 words from a 1400-word conversation). With longer chunks: hallucination loops. SeamlessM4T is a translation model — it summarized the audio (~677 words from ~1400) instead of transcribing verbatim. Neither is suited for long-form transcription.

Normalizer PSA

If you're running WER benchmarks on conversational audio using Whisper's normalizer — your numbers are probably inflated. The "oh" bug alone affects any audio with natural speech. The custom normalizer is MIT licensed and has zero dependency on the whisper package. Grab it from the repo.

Links: