Why don't Automatic speech Recognition models use prompting? [D]

Reddit r/MachineLearning / 4/25/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The author proposes that adding prompting to automatic speech recognition (ASR) could improve real-world voice-agent performance, especially for tasks like recognizing specific entities (e.g., license plates or names).
  • They note limitations of current approaches such as word boosting in platforms like Deepgram, arguing it may not work well in practical applications.
  • The post suggests that providing conversation history as context to an ASR model could further help voice agents, and explores fine-tuning ASR with prompt-like text patterns.
  • Instead of boosting a long list of explicit words (which can be infeasible or limited by context windows), the author experiments with category-level prompts (e.g., “Boost words: [Australian cities, food names, TV shows]”).
  • The central question is why prompting (and similar conditioning) is not a common feature across ASR models today, given the apparent usefulness.

I've been working on the listening part of my full-duplex speech model and I realized that ASR prompting could be very useful.

Deepgram allows for word boosting but that doesn't work that well in real word applications.

Other thing that is missing is feeding a whole conversation history as context to the ASR model. This could be very useful for voice agents.

TLDR, during the testing I realized the model can be fine tuned for prompting with text like:

<text>Expect a license plate (3 letters, 3 numbers). For example ABC123.</text><|start|> 

or

<text>Expect a person's name. It could also contain a last name. For example John Doe.</text><|start|> 

Instead of specifying all specific words to boost (which sometimes is not feasible, or you'd run out of context window) we can just specify a category of words and the model will know what to boost.

<text>Boost words: [Australian cities, food names, TV shows]</text><|start|> 

I thought that by now surely this would be something that most ASR models support but it seems like none do.

Is there a reason why this is not a common feature?

Link to the full description:

https://ketsuilabs.io/blog/listen-head

submitted by /u/kwazar90
[link] [comments]