Data Balancing Strategies: A Systematic Survey of Resampling and Augmentation Methods

arXiv stat.ML / 4/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper reviews how imbalanced datasets cause biased predictions toward majority classes and can reduce classifier performance, and it frames data balancing as a broad, method-dependent problem.
  • It systematically catalogs resampling and augmentation approaches, ranging from classic SMOTE and its variants to adaptive methods, deep generative models (GANs, VAEs, diffusion), undersampling, hybrid/combination methods, ensemble strategies, and techniques for multi-label and clustered data.
  • The review evaluates each method by its assumptions, operational mechanism, and fit for different data conditions such as high dimensionality, mixed feature types, class overlap, and noise.
  • A central conclusion is that no single balancing technique universally dominates; the best choice depends on dataset characteristics, the downstream classifier, and the evaluation metrics.
  • It also outlines forward-looking research directions, including self-supervised approaches, diffusion-based generative oversampling, distribution-preserving resampling, knowledge distillation for deployment under imbalance, and adapting foundation models to skewed data distributions.

Abstract

Imbalanced datasets, where one class significantly outnumbers others, remain a persistent challenge in machine learning, often biasing predictions toward the majority class and degrading classifier performance. This paper provides a comprehensive, systematic review of data balancing methods, extending beyond foundational oversampling techniques such as the Synthetic Minority Oversampling Technique (SMOTE) and its variants (e.g., Borderline SMOTE, K-Means SMOTE, and Safe-Level SMOTE) to encompass advanced adaptive methods (MWMOTE, AMDO), deep generative models (generative adversarial networks, variational autoencoders, and diffusion models), undersampling techniques (NearMiss, Tomek Links), combination/hybrid methods (SMOTE-ENN, SMOTE-Tomek, and SMOTE+OCSVM), ensemble strategies (SMOTEBoost, RUSBoost, Balanced Random Forest, and One-Sided Selection), and specialized approaches for multi-label and clustered data. Beyond descriptive categorization, this review critically examines each method's underlying assumptions, operational mechanisms, and suitability for diverse data characteristics, including high dimensionality, mixed feature types, class overlap, and noise. Key findings demonstrate that no single method universally outperforms others; optimal selection depends critically on dataset characteristics, classifier choice, and evaluation metrics. The paper concludes by identifying emerging research directions, including self-supervised learning for imbalance, diffusion-based generative oversampling, distribution-preserving resampling, knowledge distillation for imbalanced deployment, and the adaptation of foundation models to skewed distributions, offering practical guidelines for practitioners and a roadmap for future methodological development.