When AVSR Meets Video Conferencing: Dataset, Degradation, and the Hidden Mechanism Behind Performance Collapse

arXiv cs.CV / 3/25/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper provides the first systematic evaluation of state-of-the-art audio-visual speech recognition (AVSR) models on mainstream video conferencing (VC) platforms and finds severe performance degradation in real-world settings.
  • It attributes the collapse primarily to transmission distortions and unexpected human hyper-expression, and introduces a new VC-specific multimodal dataset, MLD-VC, built with Lombard effect usage to better capture such behavior.
  • The authors identify that speech enhancement algorithms are the key driver of distribution shift, notably altering the first and second formants of audio.
  • They show that the distribution shift created by Lombard effect closely matches that from speech enhancement, explaining why AVSR models trained with Lombard data are more robust in VC.
  • Fine-tuning AVSR models on MLD-VC reduces character error rate (CER) by an average of 17.5% across multiple VC platforms, and the dataset is released on Hugging Face.

Abstract

Audio-Visual Speech Recognition (AVSR) has achieved remarkable progress in offline conditions, yet its robustness in real-world video conferencing (VC) remains largely unexplored. This paper presents the first systematic evaluation of state-of-the-art AVSR models across mainstream VC platforms, revealing severe performance degradation caused by transmission distortions and spontaneous human hyper-expression. To address this gap, we construct \textbf{MLD-VC}, the first multimodal dataset tailored for VC, comprising 31 speakers, 22.79 hours of audio-visual data, and explicit use of the Lombard effect to enhance human hyper-expression. Through comprehensive analysis, we find that speech enhancement algorithms are the primary source of distribution shift, which alters the first and second formants of audio. Interestingly, we find that the distribution shift induced by the Lombard effect closely resembles that introduced by speech enhancement, which explains why models trained on Lombard data exhibit greater robustness in VC. Fine-tuning AVSR models on MLD-VC mitigates this issue, achieving an average 17.5% reduction in CER across several VC platforms. Our findings and dataset provide a foundation for developing more robust and generalizable AVSR systems in real-world video conferencing. MLD-VC is available at https://huggingface.co/datasets/nccm2p2/MLD-VC.