Enhancing ASR Performance in the Medical Domain for Dravidian Languages

arXiv cs.CL / 4/23/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The paper addresses low-resource medical-domain ASR for Dravidian languages such as Telugu and Kannada, where limited annotated data and morphological complexity hinder performance.
It introduces a confidence-aware training framework that fuses real and synthetic (TTS) speech using a hybrid confidence signal combining static perceptual/acoustic similarity metrics with dynamic model entropy.
Instead of straightforward fine-tuning, the method uses fixed-weight and learnable-weight confidence aggregation to compute sample weighting during training from heterogeneous data sources.
Experiments on medical datasets with both real recordings and TTS-generated audio show large gains, with Telugu WER improving from 24.3% to 15.8% and Kannada WER from 31.7% to 25.4%.
Post-decoding correction is performed with a 5-gram KenLM language model, and the proposed hybrid approach outperforms standard fine-tuning baselines while improving recognition accuracy in this specialized domain.

Abstract

Automatic Speech Recognition (ASR) for low-resource Dravidian languages like Telugu and Kannada faces significant challenges in specialized medical domains due to limited annotated data and morphological complexity. This work proposes a novel confidence-aware training framework that integrates real and synthetic speech data through a hybrid confidence mechanism combining static perceptual and acoustic similarity metrics with dynamic model entropy. Unlike direct fine-tuning approaches, the proposed methodology employs both fixed-weight and learnable-weight confidence aggregation strategies to guide sample weighting during training, enabling effective utilization of heterogeneous data sources. The framework is evaluated on Telugu and Kannada medical datasets containing both real recordings and TTS-generated synthetic speech. A 5-gram KenLM language model is applied for post-decoding correction. Results show that the hybrid confidence-aware approach with learnable weights substantially reduces recognition errors: Telugu Word Error Rate (WER) decreases from 24.3% to 15.8% (8.5% absolute improvement), while Kannada WER drops from 31.7% to 25.4% (6.3% absolute improvement), both significantly outperforming standard fine-tuning baselines. These findings confirm that combining adaptive confidence-aware training with statistical language modeling delivers superior performance for domain-specific ASR in morphologically complex Dravidian languages.