T2I-BiasBench: A Multi-Metric Framework for Auditing Demographic and Cultural Bias in Text-to-Image Models

arXiv cs.CV / 4/15/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces T2I-BiasBench, a unified multi-metric framework for auditing text-to-image (T2I) diffusion models for demographic bias, element omission, and cultural collapse simultaneously.
  • The benchmark evaluates three open-source models (Stable Diffusion v1.5, BK-SDM Base, Koala Lightning) against Gemini 2.5 Flash (RLHF-aligned) using 1,574 generated images across five structured prompt categories.
  • T2I-BiasBench uses 13 complementary metrics, including four newly proposed measures (e.g., Composite Bias Score and Cultural Accuracy Ratio) and three adapted metrics to capture different failure modes.
  • The results show bias amplification in beauty-related prompts for Stable Diffusion v1.5 and BK-SDM, while certain contextual constraints (e.g., surgical PPE) can attenuate professional-role gender bias.
  • Cultural coverage gaps persist across all evaluated models, with alignment (including RLHF) not preventing cultural representation collapse, and the benchmark is publicly released for standardized evaluation.

Abstract

Text-to-image (T2I) generative models achieve impressive visual fidelity but inherit and amplify demographic imbalances and cultural biases embedded in training data. We introduce T2I-BiasBench, a unified evaluation framework of thirteen complementary metrics that jointly captures demographic bias, element omission, and cultural collapse in diffusion models - the first framework to address all three dimensions simultaneously. We evaluate three open-source models - Stable Diffusion v1.5, BK-SDM Base, and Koala Lightning - against Gemini 2.5 Flash (RLHF-aligned) as a reference baseline. The benchmark comprises 1,574 generated images across five structured prompt categories. T2I-BiasBench integrates six established metrics with seven additional measures: four newly proposed (Composite Bias Score, Grounded Missing Rate, Implicit Element Missing Rate, Cultural Accuracy Ratio) and three adapted (Hallucination Score, Vendi Score, CLIP Proxy Score). Three key findings emerge: (1) Stable Diffusion v1.5 and BK-SDM exhibit bias amplification (>1.0) in beauty-related prompts; (2) contextual constraints such as surgical PPE substantially attenuate professional-role gender bias (Doctor CBS = 0.06 for SD v1.5); and (3) all models, including RLHF-aligned Gemini, collapse to a narrow set of cultural representations (CAS: 0.54-1.00), confirming that alignment techniques do not resolve cultural coverage gaps. T2I-BiasBench is publicly released to support standardized, fine-grained bias evaluation of generative models. The project page is available at: https://gyanendrachaubey.github.io/T2I-BiasBench/