Generative Modeling under Non-Monotonic MAR Missingness via Approximate Wasserstein Gradient Flows

arXiv stat.ML / 4/7/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces FLOWGEM, a principled iterative generative method to produce complete datasets from MAR-missing data without relying on ad-hoc imputation.
  • FLOWGEM is designed to minimize the expected KL divergence between the observed data distribution and the distribution of generated samples across different missingness patterns, drawing motivation from convergence results for ignoring maximum likelihood estimation.
  • To achieve this optimization, the method uses a discretized particle evolution based on Wasserstein Gradient Flows, with the velocity field approximated via a local linear estimator of the density ratio.
  • Experiments including simulation studies and real-data benchmarks indicate FLOWGEM reaches state-of-the-art performance, notably improving results for non-monotonic MAR mechanisms.
  • Overall, the work positions FLOWGEM as a theoretically grounded and practically competitive alternative to existing imputation approaches, bridging theory and empirical performance.

Abstract

The prevalence of missing values in data science poses a substantial risk to any further analyses. Despite a wealth of research, principled nonparametric methods to deal with general non-monotone missingness are still scarce. Instead, ad-hoc imputation methods are often used, for which it remains unclear whether the correct distribution can be recovered. In this paper, we propose FLOWGEM, a principled iterative method for generating a complete dataset from a dataset with values Missing at Random (MAR). Motivated by convergence results of the ignoring maximum likelihood estimator, our approach minimizes the expected Kullback-Leibler (KL) divergence between the observed data distribution and the distribution of the generated sample over different missingness patterns. To minimize the KL divergence, we employ a discretized particle evolution of the corresponding Wasserstein Gradient Flow, where the velocity field is approximated using a local linear estimator of the density ratio. This construction yields a data generation scheme that iteratively transports an initial particle ensemble toward the target distribution. Simulation studies and real-data benchmarks demonstrate that FLOWGEM achieves state-of-the-art performance across a range of settings, including the challenging case of non-monotonic MAR mechanisms. Together, these results position FLOWGEM as a principled and practical alternative to existing imputation methods, and a decisive step towards closing the gap between theoretical rigor and empirical performance.