Why don't Automatic speech Recognition models use prompting? [D]

Reddit r/MachineLearning / 4/25/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The author proposes that adding prompting to automatic speech recognition (ASR) could improve real-world voice-agent performance, especially for tasks like recognizing specific entities (e.g., license plates or names).
They note limitations of current approaches such as word boosting in platforms like Deepgram, arguing it may not work well in practical applications.
The post suggests that providing conversation history as context to an ASR model could further help voice agents, and explores fine-tuning ASR with prompt-like text patterns.
Instead of boosting a long list of explicit words (which can be infeasible or limited by context windows), the author experiments with category-level prompts (e.g., “Boost words: [Australian cities, food names, TV shows]”).
The central question is why prompting (and similar conditioning) is not a common feature across ASR models today, given the apparent usefulness.

I've been working on the listening part of my full-duplex speech model and I realized that ASR prompting could be very useful.

Deepgram allows for word boosting but that doesn't work that well in real word applications.

Other thing that is missing is feeding a whole conversation history as context to the ASR model. This could be very useful for voice agents.

TLDR, during the testing I realized the model can be fine tuned for prompting with text like:

<text>Expect a license plate (3 letters, 3 numbers). For example ABC123.</text><|start|>

<text>Expect a person's name. It could also contain a last name. For example John Doe.</text><|start|>

Instead of specifying all specific words to boost (which sometimes is not feasible, or you'd run out of context window) we can just specify a category of words and the model will know what to boost.

<text>Boost words: [Australian cities, food names, TV shows]</text><|start|>

I thought that by now surely this would be something that most ASR models support but it seems like none do.

Is there a reason why this is not a common feature?

Link to the full description:

https://ketsuilabs.io/blog/listen-head

submitted by /u/kwazar90
[link] [comments]

Just what the doctor ordered: how AI could help China bridge the medical resources gap

SCMP Tech

Automating Advanced Customization in Your Music Studio

Dev.to

CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos

Dev.to

My AI Agent Over-Corrected Itself — So I Built Metabolic Regulation

Dev.to

Building Agent Arena: Using Valkey as the Nervous System for Multi-Agent AI

Dev.to

Why don't Automatic speech Recognition models use prompting? [D]

Key Points

Related Articles

Just what the doctor ordered: how AI could help China bridge the medical resources gap

Automating Advanced Customization in Your Music Studio

CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos

My AI Agent Over-Corrected Itself — So I Built Metabolic Regulation

Building Agent Arena: Using Valkey as the Nervous System for Multi-Agent AI

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer