When AVSR Meets Video Conferencing: Dataset, Degradation, and the Hidden Mechanism Behind Performance Collapse
arXiv cs.CV / 3/25/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper provides the first systematic evaluation of state-of-the-art audio-visual speech recognition (AVSR) models on mainstream video conferencing (VC) platforms and finds severe performance degradation in real-world settings.
- It attributes the collapse primarily to transmission distortions and unexpected human hyper-expression, and introduces a new VC-specific multimodal dataset, MLD-VC, built with Lombard effect usage to better capture such behavior.
- The authors identify that speech enhancement algorithms are the key driver of distribution shift, notably altering the first and second formants of audio.
- They show that the distribution shift created by Lombard effect closely matches that from speech enhancement, explaining why AVSR models trained with Lombard data are more robust in VC.
- Fine-tuning AVSR models on MLD-VC reduces character error rate (CER) by an average of 17.5% across multiple VC platforms, and the dataset is released on Hugging Face.
Related Articles
Santa Augmentcode Intent Ep.6
Dev.to

Your Agent Hired Another Agent. The Output Was Garbage. The Money's Gone.
Dev.to
ClawRouter vs TeamoRouter: one requires a crypto wallet, one doesn't
Dev.to
Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Palantir’s billionaire CEO says only two kinds of people will succeed in the AI era: trade workers — ‘or you’re neurodivergent’
Reddit r/artificial