Learnability-Guided Diffusion for Dataset Distillation

arXiv cs.CV / 4/2/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses the cost of training on large datasets by using dataset distillation to create a much smaller synthetic dataset that preserves the original model’s performance.
  • It argues that prior diffusion-based distillation methods often generate redundant training signals because they optimize diversity or average training dynamics without explicitly accounting for similarity between distilled samples.
  • The authors propose learnability-driven dataset distillation, an incremental multi-stage curriculum that adds synthetic samples guided by how learnable they are for the current model.
  • They introduce Learnability-Guided Diffusion (LGD), which balances a sample’s training utility for the current model against validity under a reference model to keep generated samples aligned with the intended curriculum.
  • Experiments show a 39.1% reduction in redundancy and improved results, reporting state-of-the-art performance on ImageNet-1K (60.1%), ImageNette (87.2%), and ImageWoof (72.9%), with code released via the project page.

Abstract

Training machine learning models on massive datasets is expensive and time-consuming. Dataset distillation addresses this by creating a small synthetic dataset that achieves the same performance as the full dataset. Recent methods use diffusion models to generate distilled data, either by promoting diversity or matching training gradients. However, existing approaches produce redundant training signals, where samples convey overlapping information. Empirically, disjoint subsets of distilled datasets capture 80-90% overlapping signals. This redundancy stems from optimizing visual diversity or average training dynamics without accounting for similarity across samples, leading to datasets where multiple samples share similar information rather than complementary knowledge. We propose learnability-driven dataset distillation, which constructs synthetic datasets incrementally through successive stages. Starting from a small set, we train a model and generate new samples guided by learnability scores that identify what the current model can learn from, creating an adaptive curriculum. We introduce Learnability-Guided Diffusion (LGD), which balances training utility for the current model with validity under a reference model to generate curriculum-aligned samples. Our approach reduces redundancy by 39.1%, promotes specialization across training stages, and achieves state-of-the-art results on ImageNet-1K (60.1%), ImageNette (87.2%), and ImageWoof (72.9%). Our code is available on our project page https://jachansantiago.github.io/learnability-guided-distillation/.