How do virtual assistants work? [D]

Reddit r/MachineLearning / 4/19/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The post asks how mainstream virtual assistants such as Siri, Alexa, Bixby, Cortana, and Google Assistant work under the hood.
  • The author has found limited and somewhat vague explanations, often expressed as high-level component “boxes” rather than concrete mechanisms.
  • They are specifically interested in how these systems worked before LLMs and modern chat-based AI agents, including architectures involving speech-to-text, tool calling, and text-to-speech.
  • The author suspects intent matching may be a key step (using classifiers and/or rule-based matching) and is asking whether that accounts for most of the functionality.
  • They request pointers to widely used literature that explains these systems historically and in practical terms.

How do virtual assistants like Siri, Alexa, Bixby, Cortana, and Google assistant work? I have found some things searching how Google assistant and Siri work, and this book on Google books: using Google scholar https://books.google.com/books?hl=en&lr=&id=H7daEAAAQBAJ&oi=fnd&pg=PP12&dq=info:OJRgUdIalvcJ:scholar.google.com/&ots=9luE8VnJh1&sig=RW40JMpgGsZgenYaI2GEsLfbGUk&redir_esc=y#v=onepage&q&f=false but besides the book I have not been able to find how they work and when I do the diagrams and descriptions seem to be quite vague and generalize a lot like grouping components into boxes in diagrams.

Or they seem to be too specific for a niche. I am looking to see how they worked before LLMs became popular where there are AI agents which are LLMs receiving speech to text and then calling tools and doing text to speech. like openclaw. I am looking to see how it would have been done before chatgpt was released

I have found mentions about intent matching which is probably a text classifier using a custom trained classifier and rule based matching like string matching in programming with else ifs or something similar and then calling "tools" based on the result. But I am wondering if that's really it

If anyone can point me to any widely used literature I would appreciate it.

submitted by /u/SeyAssociation38
[link] [comments]