Data Balancing Strategies: A Systematic Survey of Resampling and Augmentation Methods
arXiv stat.ML / 4/30/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper reviews how imbalanced datasets cause biased predictions toward majority classes and can reduce classifier performance, and it frames data balancing as a broad, method-dependent problem.
- It systematically catalogs resampling and augmentation approaches, ranging from classic SMOTE and its variants to adaptive methods, deep generative models (GANs, VAEs, diffusion), undersampling, hybrid/combination methods, ensemble strategies, and techniques for multi-label and clustered data.
- The review evaluates each method by its assumptions, operational mechanism, and fit for different data conditions such as high dimensionality, mixed feature types, class overlap, and noise.
- A central conclusion is that no single balancing technique universally dominates; the best choice depends on dataset characteristics, the downstream classifier, and the evaluation metrics.
- It also outlines forward-looking research directions, including self-supervised approaches, diffusion-based generative oversampling, distribution-preserving resampling, knowledge distillation for deployment under imbalance, and adapting foundation models to skewed data distributions.
Related Articles
Vector DB and ANN vs PHE conflict, is there a practical workaround? [D]
Reddit r/MachineLearning

Agent Amnesia and the Case of Henry Molaison
Dev.to

Azure Weekly: Microsoft and OpenAI Restructure Partnership as GPT-5.5 Lands in Foundry
Dev.to

Proven Patterns for OpenAI Codex in 2026: Prompts, Validation, and Gateway Governance
Dev.to

Vibe coding is a tool, not a shortcut. Most people are using it wrong.
Dev.to