What Are We Really Measuring? Rethinking Dataset Bias in Web-Scale Natural Image Collections via Unsupervised Semantic Clustering

arXiv cs.CV / 4/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper challenges a common computer-vision practice of measuring dataset bias by training a classifier to distinguish between datasets, arguing that high accuracy does not necessarily imply semantic differences.
  • It shows that dataset identification is often driven by resolution and resizing artifacts (structural fingerprints) that survive standard corruptions and conventional augmentations.
  • Using controlled experiments, the authors demonstrate that models can still classify datasets using superficial, non-semantic procedural images, indicating reliance on low-level cues.
  • To measure semantic separability more faithfully, they propose an unsupervised framework that clusters semantic features from foundation vision models instead of using supervised classification on dataset labels.
  • When applied to major web-scale datasets, the previously reported high “separability” largely disappears under semantic clustering, with clustering accuracy dropping to near chance, suggesting semantic bias has been substantially overstated.

Abstract

In computer vision, a prevailing method for quantifying dataset bias is to train a model to distinguish between datasets. High classification accuracy is then interpreted as evidence of meaningful semantic differences. This approach assumes that standard image augmentations successfully suppress low-level, non-semantic cues, and that any remaining performance must therefore reflect true semantic divergence. We demonstrate that this fundamental assumption is flawed within the domain of large-scale natural image collections. High classification accuracy is often driven by resolution-based artifacts, which are structural fingerprints arising from native image resolution distributions and interpolation effects during resizing. These artifacts form robust, dataset-specific signatures that persist despite conventional image corruptions. Through controlled experiments, we show that models achieve strong dataset classification even on non-semantic, procedurally generated images, proving their reliance on superficial cues. To address this issue, we revisit this decades-old idea of dataset separability, but not with supervised classification. Instead, we introduce an unsupervised approach that measures true semantic separability. Our framework directly assesses semantic similarity by clustering semantically-rich features from foundational vision models, deliberately bypassing supervised classification on dataset labels. When applied to major web-scale datasets, the primary focus of this work, the high separability reported by supervised methods largely vanishes, with clustering accuracy dropping to near-chance levels. This reveals that conventional classification-based evaluation systematically overstates semantic bias by an overwhelming margin.