Au-M-ol: A Unified Model for Medical Audio and Language Understanding

arXiv cs.CL / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • Au-M-ol is a new multimodal architecture that extends Large Language Models with audio processing to better understand medical speech.
  • The model combines three components: an audio encoder for clinical acoustic features, an adaptation layer that maps those features into the LLM input space, and a pretrained LLM for transcription and clinical language understanding.
  • Experiments on medical transcription tasks show a 56% reduction in Word Error Rate (WER) versus state-of-the-art baselines.
  • Au-M-ol also improves robustness under noisy conditions, domain-specific terminology, and speaker variability, indicating potential for reliable real-world clinical use.
  • Overall, the results position Au-M-ol as a strong candidate for clinical ASR and context-aware spoken content interpretation.

Abstract

In this work, we present Au-M-ol, a novel multimodal architecture that extends Large Language Models (LLMs) with audio processing. It is designed to improve performance on clinically relevant tasks such as Automatic Speech Recognition (ASR). Au-M-ol has three main components: (1) an audio encoder that extracts rich acoustic features from medical speech, (2) an adaptation layer that maps audio features into the LLM input space, and (3) a pretrained LLM that performs transcription and clinical language understanding. This design allows the model to interpret spoken medical content directly, improving both accuracy and robustness. In experiments, Au-M-ol reduces Word Error Rate (WER) by 56\% compared to state-of-the-art baselines on medical transcription tasks. The model also performs well in challenging conditions, including noisy environments, domain-specific terminology, and speaker variability. These results suggest that Au-M-ol is a strong candidate for real-world clinical applications, where reliable and context-aware audio understanding is essential.