Demographic and Linguistic Bias Evaluation in Omnimodal Language Models

arXiv cs.CV / 4/14/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper evaluates demographic and linguistic bias in omnimodal language models that jointly process text, images, audio, and video, focusing on performance gaps across demographic groups and languages.
  • It tests four omnimodal models on tasks including demographic attribute estimation, identity verification, activity recognition, multilingual speech transcription, and language identification.
  • Results indicate image and video understanding tasks have smaller demographic disparities, while audio understanding shows much lower accuracy and substantial bias.
  • The study finds significant bias in audio tasks across age, gender, skin tone, and language, including cases of prediction collapse toward narrow categories.
  • The authors argue that fairness evaluation must cover all modalities supported by omnimodal models as these systems are increasingly deployed in real-world applications.

Abstract

This paper provides a comprehensive evaluation of demographic and linguistic biases in omnimodal language models that process text, images, audio, and video within a single framework. Although these models are being widely deployed, their performance across different demographic groups and modalities is not well studied. Four omnimodal models are evaluated on tasks that include demographic attribute estimation, identity verification, activity recognition, multilingual speech transcription, and language identification. Accuracy differences are measured across age, gender, skin tone, language, and country of origin. The results show that image and video understanding tasks generally exhibit better performance with smaller demographic disparities. In contrast, audio understanding tasks exhibit significantly lower performance and substantial bias, including large accuracy differences across age groups, genders, and languages, and frequent prediction collapse toward narrow categories. These findings highlight the importance of evaluating fairness across all supported modalities as omnimodal language models are increasingly used in real-world applications.