Smooth Flow Matching for Synthesizing Functional Data

arXiv stat.ML / 4/7/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes Smooth Flow Matching (SFM), a new generative modeling framework for functional (smooth, continuous-domain) data that targets privacy constraints, sparse/irregular sampling, and non-Gaussianity.
  • SFM uses a copula-based approach to build a smooth, parsimonious generative flow that produces infinite-dimensional functions without requiring Gaussian assumptions or low-rank structure.
  • The method is described as computationally efficient and able to handle irregular observations while guaranteeing smoothness in the generated outputs.
  • Simulation experiments indicate SFM improves synthetic data quality and computational efficiency relative to alternatives that may struggle under functional-data constraints.
  • An application to clinical trajectory data synthesized from MIMIC-IV EHR longitudinal records demonstrates that SFM can generate high-quality surrogate data to support downstream clinical analytics while mitigating exposure of sensitive real data.

Abstract

Functional data, i.e., smooth random functions observed over a continuous domain, are increasingly available in areas such as biomedical research, health informatics, and epidemiology. However, effective statistical analysis for functional data is often hindered by challenges such as privacy constraints, sparse and irregular sampling, infinite-dimensionality, and non-Gaussian structures. To address these challenges, we introduce a novel framework named Smooth Flow Matching (SFM), tailored for generative modeling of functional data that enables statistical analysis without exposing sensitive real data. Under a copula framework, SFM constructs a parsimonious smooth flow to generate infinite-dimensional functional data, free of Gaussianity and low-rank assumptions. It is computationally efficient, handles irregular observations, and guarantees the smoothness of the generated functions, offering a practical and flexible solution in scenarios where existing deep generative methods are not applicable. Through extensive simulation studies, we demonstrate the advantages of SFM in terms of both synthetic data quality and computational efficiency. We then apply SFM to generate clinical trajectory data from the MIMIC-IV patient electronic health records (EHR) longitudinal database. Our analysis showcases the ability of SFM to produce high-quality surrogate data for downstream tasks, highlighting its potential to boost the utility of EHR data for clinical applications.