Fair Dataset Distillation via Cross-Group Barycenter Alignment

arXiv cs.AI / 5/4/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies dataset distillation and finds that demographic groups with different predictive patterns make it hard to preserve useful signals for all subgroups at once.
  • It shows that performance losses for some subgroups (and resulting fairness gaps) can occur regardless of whether group sizes are only slightly or highly imbalanced.
  • The authors argue these fairness gaps are not fixed simply by correcting group imbalance, because they arise from fundamental mismatches in subgroup predictive patterns rather than from sample-size effects.
  • They propose a formal solution based on finding a group-imbalance-agnostic “barycenter” of predictive information, then distilling toward a shared aggregate representation across subgroups.
  • Experiments indicate the method is compatible with existing distillation approaches and substantially reduces bias introduced by dataset distillation.

Abstract

Dataset Distillation aims to compress a large dataset into a small synthetic one while maintaining predictive performance. We show that as different demographic groups exhibit distinct predictive patterns, the distillation process struggles to simultaneously preserve informative signals for all subgroups, regardless of whether group sizes are mildly or severely imbalanced. Consequently, models trained on distilled data can experience substantial performance drops for certain subgroups, leading to fairness gaps. Crucially, these gaps do not disappear by merely correcting group imbalance, since they stem from fundamental mismatches in subgroup predictive patterns rather than from sample-size disparities alone. We therefore formally analyze the interaction between these two sources of bias and cast the solution as identifying a group-imbalance-agnostic barycenter of the predictive information that induces similar representations across all subgroups. By distilling toward this shared aggregate representation, we show that group fairness concerns can be reduced. Our approach is compatible with existing distillation methods, and empirical results show that it substantially reduces bias introduced by dataset distillation.