| TL;DR: v3 of my medical speech-to-text benchmark. 31 models now (up from 26 in v2). Microsoft VibeVoice-ASR 9B takes the open-source crown at 8.34% WER, nearly matching Gemini 2.5 Pro (8.15%). But it's 9B params, needs ~18GB VRAM (ran it on an H100 since I had easy access, but an L4 or similar would work too), and even on H100 it's slow — 97s per file vs 6s for Parakeet. Also found bugs in Whisper's text normalizer that were inflating WER by 2-3% across every model. All code + results are open-source. Previous posts: v1 — 15 models | v2 — 26 models What changed since v25 new models added (26 → 31):
Also evaluated LiquidAI's LFM2.5-Audio-1.5B and Meta's SeamlessM4T v2 Large, but neither was suitable for this benchmark (more below in takeaways). Replaced Whisper's normalizer with a custom one. This is the bigger deal. Found two bugs in Whisper's
Combined, these bugs inflated WER by ~2-3% across ALL models. Every score in v3 is recalculated with the custom normalizer. Code is in Top 15 LeaderboardDataset: PriMock57 — 55 doctor-patient consultations, ~80K words of British English medical dialogue.
Full 31-model leaderboard (including the bottom half with Granite, Phi-4, MedASR etc.) on GitHub. Key takeawaysVibeVoice is legit — but heavy and slow. At 9B params it's the first open-source model to genuinely compete with Gemini-tier cloud APIs on medical audio. Needs ~18GB VRAM (won't fit on T4, but doesn't need an H100 either — L4/A10 should work). Even on H100 though, 97s per file is slow compared to other local models. Parakeet TDT 0.6B v3 is the real edge story. 9.35% WER at 6 seconds per file on Apple Silicon. A 0.6B model getting within 1% of a 9B model. ElevenLabs Scribe v2 is a meaningful upgrade. 9.72% vs 10.87% for v1. Best cloud API option if you don't want to go Google. LFM Audio and SeamlessM4T didn't make the cut. LFM2.5-Audio-1.5B isn't a dedicated ASR model — transcription is a secondary capability via prompting. With recommended 2s chunks: sparse keyword extractions (~74 words from a 1400-word conversation). With longer chunks: hallucination loops. SeamlessM4T is a translation model — it summarized the audio (~677 words from ~1400) instead of transcribing verbatim. Neither is suited for long-form transcription. Normalizer PSAIf you're running WER benchmarks on conversational audio using Whisper's normalizer — your numbers are probably inflated. The "oh" bug alone affects any audio with natural speech. The custom normalizer is MIT licensed and has zero dependency on the whisper package. Grab it from the repo. Links:
[link] [comments] |
I benchmarked 31 STT models on medical audio — VibeVoice 9B is the new open-source leader at 8.34% WER, but it's big and slow
Reddit r/LocalLLaMA / 3/27/2026
💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- The updated medical speech-to-text benchmark covers 31 STT models and finds Microsoft VibeVoice-ASR 9B as the new open-source leader with 8.34% WER, close to Gemini 2.5 Pro at 8.15%.
- VibeVoice-ASR 9B’s accuracy comes with steep deployment costs: it requires ~18GB VRAM (tested on an H100) and is slow at about 97 seconds per file versus ~6 seconds for faster baselines like Parakeet.
- The benchmark adds five new evaluated models, including ElevenLabs Scribe v2 and several Voxtral/NVIDIA streaming-oriented options, each with different tradeoffs in accuracy and hardware suitability.
- The biggest methodological change is fixing Whisper’s EnglishTextNormalizer bugs, where issues like treating “oh” as zero and missing common medical word equivalences inflated WER across models by an estimated 2–3%.
- All benchmark code and results are published as open-source, enabling others to reproduce and compare STT performance under medical audio conditions.
広告
Related Articles
![[Boost]](/_next/image?url=https%3A%2F%2Fmedia2.dev.to%2Fdynamic%2Fimage%2Fwidth%3D800%252Cheight%3D%252Cfit%3Dscale-down%252Cgravity%3Dauto%252Cformat%3Dauto%2Fhttps%253A%252F%252Fdev-to-uploads.s3.amazonaws.com%252Fuploads%252Fuser%252Fprofile_image%252F3618325%252F470cf6d0-e54c-4ddf-8d83-e3db9f829f2b.jpg&w=3840&q=75)
[Boost]
Dev.to

Managing LLM context in a real application
Dev.to

Got My 39-Agent System Audited Live. Here's What the Maturity Scorecard Revealed.
Dev.to

OpenAI Killed Sora — Here's Your 10-Minute Migration Guide (Free API)
Dev.to

Switching my AI voice agent from WebSocket to WebRTC — what broke and what I learned
Dev.to