What small speech to text (STT) model is best at recognizing whispered speech?

Reddit r/LocalLLaMA / 5/20/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical Usage

共有:

Key Points

The post asks which small speech-to-text (STT) model can best recognize whispered speech on a midrange phone.
It specifically focuses on practical deployment constraints, implying the model should be runnable on-device rather than requiring heavy infrastructure.
The discussion also raises whether an existing STT model could be fine-tuned to improve performance for whispered audio.
The underlying goal is to find a workaround for situations where speaking to a phone is socially inappropriate.

Speaking to a phone is not appropriate in all social situations.

What STT model, runnable on a midrange phone, is good at recognizing whispered speech?

Could an existing STT model be finetuned to be better at recognizing whispered speech?

Thank you.