LatentDiff: Scaling Semantic Dataset Comparison to Millions of Images

arXiv cs.CV / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces LatentDiff, a framework for comparing visual datasets by working directly in the latent space of pretrained vision encoders rather than using caption-based approaches.
  • LatentDiff combines sparse autoencoder-based divergence testing with density ratio estimation to find interpretable semantic differences between datasets at much lower computational cost.
  • The authors propose Noisy-Diff, a benchmark designed to model realistic sparse distribution shifts that commonly break existing dataset-comparison methods.
  • Experiments indicate LatentDiff delivers higher accuracy and is robust even when only a very small fraction of images (around 5% down to below 1%) differ semantically.
  • Overall, the work targets scalable, semantic-level dataset comparison for large-scale image corpora with improved efficiency and robustness.

Abstract

We present LatentDiff, a scalable framework for semantic dataset comparison that operates directly in the latent space of pretrained vision encoders. By combining sparse autoencoder-based divergence testing with density ratio estimation, LatentDiff identifies interpretable semantic differences between datasets at a fraction of the computational cost of caption-based alternatives. We also introduce Noisy-Diff, a benchmark capturing realistic sparse distribution shifts that cause existing methods to struggle. Experiments demonstrate that LatentDiff achieves superior accuracy while remaining robust to settings where an extremely small fraction of images (from 5% to <1% ) differ semantically.