Rethinking Dataset Distillation: Hard Truths about Soft Labels

arXiv cs.LG / 4/22/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

New evidence suggests that, in soft-label downstream training, simple random subsets can match state-of-the-art dataset distillation (DD) methods, undermining assumptions that DD quality improvements always matter.
A scalability analysis across soft-label (SL), fixed soft-label (SL), and hard-label (HL) regimes finds that high-quality coresets do not clearly beat random baselines in SL and SL+KD, and performance saturates near full-dataset levels in the SL+KD setting for a fixed compute budget.
The results challenge common evaluation practices that rely on soft labels, because—unlike hard-label settings—subset quality has negligible impact on evaluation outcomes under soft-label training.
In the HL setting, only the RDED DD method consistently beats random baselines on ImageNet-1K, though it can still trail strong coreset approaches due to over-reliance on easy sample patches.
The paper proposes CAD-Prune and a compute-aligned DD method CA2D, using compute-aware pruning to select optimally difficult samples and outperform existing DD methods on ImageNet-1K under various IPC settings.

Abstract

Despite the perceived success of large-scale dataset distillation (DD) methods, recent evidence finds that simple random image baselines perform on-par with state-of-theart DD methods like SRe2L due to the use of soft labels during downstream model training. This is in contrast with the findings in coreset literature, where high-quality coresets consistently outperform random subsets in the hardlabel (HL) setting. To understand this discrepancy, we perform a detailed scalability analysis to examine the role of data quality under different label regimes, ranging from abundant soft labels (termed as SL+KD regime) to fixed soft labels (SL) and hard labels (HL). Our analysis reveals that high-quality coresets fail to convincingly outperform the random baseline in both SL and SL+KD regimes. In the SL+KD setting, performance further approaches nearoptimal levels relative to the full dataset, regardless of subset size or quality, for a given compute budget. This performance saturation calls into question the widespread practice of using soft labels for model evaluation, where unlike the HL setting, subset quality has negligible influence. A subsequent systematic evaluation of five large-scale and four small-scale DD methods in the HL setting reveals that only RDED reliably outperforms random baselines on ImageNet-1K, but can still lag behind strong coreset methods due to its over-reliance on easy sample patches. Based on this, we introduce CAD-Prune, a compute-aware pruning metric that efficiently identifies samples of optimal difficulty for a given compute budget, and use it to develop CA2D, a compute-aligned DD method, outperforming current DD methods on ImageNet-1K at various IPC settings. Together, our findings uncover many insights into current DD research and establish useful tools to advance dataefficient learning for both coresets and DD.

Autoencoders and Representation Learning in Vision

Dev.to

Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks

Dev.to

Context Bloat in AI Agents

Dev.to

We open sourced the AI dev team that builds our product

Dev.to

Qwen 3.6 35B A3B vs Qwen 3.5 122B A10B

Reddit r/LocalLLaMA

Rethinking Dataset Distillation: Hard Truths about Soft Labels

Key Points

Abstract

Related Articles

Autoencoders and Representation Learning in Vision

Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks

Context Bloat in AI Agents

We open sourced the AI dev team that builds our product

Qwen 3.6 35B A3B vs Qwen 3.5 122B A10B

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer