Improving Machine Learning Performance with Synthetic Augmentation

arXiv cs.AI / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper reframes synthetic augmentation in machine learning as altering the effective training distribution, clarifying how this changes the underlying bias–variance trade-off.
  • It argues that synthetic data can reduce estimation error by adding samples, but it can also worsen population-level objectives when the synthetic distribution differs from what matters at evaluation time.
  • To separate true informational gains from pure sample-size effects, the authors propose a size-matched null augmentation and a finite-sample, non-parametric block permutation test that stays valid under weak temporal dependence.
  • Experiments across multiple synthetic-data generators (from bootstrap and copula models to VAEs, diffusion models, and TimeGAN) show augmentation is helpful mainly in variance-dominant regimes, while it can hurt in bias-dominant tasks; rare-regime targeting may improve some domain metrics but can undermine unconditional permutation-based inference.
  • The study evaluates both controlled Markov-switching simulations and real financial datasets (high-frequency options trading and daily equity panels), offering a structural guide to when synthetic augmentation will improve versus distort financial model performance.

Abstract

Synthetic augmentation is increasingly used to mitigate data scarcity in financial machine learning, yet its statistical role remains poorly understood. We formalize synthetic augmentation as a modification of the effective training distribution and show that it induces a structural bias--variance trade-off: while additional samples may reduce estimation error, they may also shift the population objective whenever the synthetic distribution deviates from regions relevant under evaluation. To isolate informational gains from mechanical sample-size effects, we introduce a size-matched null augmentation and a finite-sample, non-parametric block permutation test that remains valid under weak temporal dependence. We evaluate this framework in both controlled Markov-switching environments and real financial datasets, including high-frequency option trade data and a daily equity panel. Across generators spanning bootstrap, copula-based models, variational autoencoders, diffusion models, and TimeGAN, we vary augmentation ratio, model capacity, task type, regime rarity, and signal-to-noise. We show that synthetic augmentation is beneficial only in variance-dominant regimes, such as persistent volatility forecasting-while it deteriorates performance in bias-dominant settings, including near-efficient directional prediction. Rare-regime targeting can improve domain-specific metrics but may conflict with unconditional permutation inference. Our results provide a structural perspective on when synthetic data improves financial learning performance and when it induces persistent distributional distortion.