Synthesizing real-world distributions from high-dimensional Gaussian Noise with Fully Connected Neural Network

arXiv cs.LG / 4/13/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces a fast synthetic data generation approach that uses a fully connected neural network and a randomized loss function to map high-dimensional Gaussian noise into a target real-world tabular data distribution.
  • Experiments on 25 diverse real-world tabular datasets show the method achieves better distributional similarity than prior state-of-the-art generative approaches while producing results far faster than modern deep learning-based solutions.
  • The study evaluates outcomes using distributional similarity metrics (including MMD), downstream classification quality, and PCA-based dimensionality reduction to improve privacy and reduce time/memory complexity.
  • The authors frame the method as supporting key synthetic data goals—data augmentation benefits, privacy preservation via fully synthetic samples, and reliable assessment without relying on original data.

Abstract

The use of synthetic data in machine learning applications and research offers many benefits, including performance improvements through data augmentation, privacy preservation of original samples, and reliable method assessment with fully synthetic data. This work proposes a time-efficient synthetic data generation method based on a fully connected neural network and a randomized loss function that transforms a random Gaussian distribution to approximate a target real-world dataset. The experiments conducted on 25 diverse tabular real-world datasets confirm that the proposed solution surpasses the state-of-the-art generative methods and achieves reference MMD scores orders of magnitude faster than modern deep learning solutions. The experiments involved analyzing distributional similarity, assessing the impact on classification quality, and using PCA for dimensionality reduction, which further enhances data privacy and can boost classification quality while reducing time and memory complexity.