Rethinking Dataset Distillation: Hard Truths about Soft Labels

arXiv cs.LG / 4/22/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • New evidence suggests that, in soft-label downstream training, simple random subsets can match state-of-the-art dataset distillation (DD) methods, undermining assumptions that DD quality improvements always matter.
  • A scalability analysis across soft-label (SL), fixed soft-label (SL), and hard-label (HL) regimes finds that high-quality coresets do not clearly beat random baselines in SL and SL+KD, and performance saturates near full-dataset levels in the SL+KD setting for a fixed compute budget.
  • The results challenge common evaluation practices that rely on soft labels, because—unlike hard-label settings—subset quality has negligible impact on evaluation outcomes under soft-label training.
  • In the HL setting, only the RDED DD method consistently beats random baselines on ImageNet-1K, though it can still trail strong coreset approaches due to over-reliance on easy sample patches.
  • The paper proposes CAD-Prune and a compute-aligned DD method CA2D, using compute-aware pruning to select optimally difficult samples and outperform existing DD methods on ImageNet-1K under various IPC settings.

Abstract

Despite the perceived success of large-scale dataset distillation (DD) methods, recent evidence finds that simple random image baselines perform on-par with state-of-theart DD methods like SRe2L due to the use of soft labels during downstream model training. This is in contrast with the findings in coreset literature, where high-quality coresets consistently outperform random subsets in the hardlabel (HL) setting. To understand this discrepancy, we perform a detailed scalability analysis to examine the role of data quality under different label regimes, ranging from abundant soft labels (termed as SL+KD regime) to fixed soft labels (SL) and hard labels (HL). Our analysis reveals that high-quality coresets fail to convincingly outperform the random baseline in both SL and SL+KD regimes. In the SL+KD setting, performance further approaches nearoptimal levels relative to the full dataset, regardless of subset size or quality, for a given compute budget. This performance saturation calls into question the widespread practice of using soft labels for model evaluation, where unlike the HL setting, subset quality has negligible influence. A subsequent systematic evaluation of five large-scale and four small-scale DD methods in the HL setting reveals that only RDED reliably outperforms random baselines on ImageNet-1K, but can still lag behind strong coreset methods due to its over-reliance on easy sample patches. Based on this, we introduce CAD-Prune, a compute-aware pruning metric that efficiently identifies samples of optimal difficulty for a given compute budget, and use it to develop CA2D, a compute-aligned DD method, outperforming current DD methods on ImageNet-1K at various IPC settings. Together, our findings uncover many insights into current DD research and establish useful tools to advance dataefficient learning for both coresets and DD.