Cascaded Flow Matching for Heterogeneous Tabular Data with Mixed-Type Features

arXiv stat.ML / 5/4/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper tackles a key limitation in generative modeling for tabular data by improving how diffusion/flow-matching approaches generate features that mix discrete and continuous types within a single row.
It introduces a cascaded method that first produces a low-resolution table row (categorical features plus coarse categorical representations of numerical features) and then uses this as guidance for a high-resolution flow-matching stage.
The high-resolution model relies on a guided conditional probability path and a data-dependent coupling mechanism, designed to better handle discrete outcomes such as missing or inflated numerical values.
The authors provide a formal proof that the cascade tightens the transport cost bound, and report empirical gains including a 51.9% improvement in the detection score.
The work is accompanied by released code at the provided GitHub repository, enabling others to reproduce and build on the approach.

Abstract

Advances in generative modeling have recently been adapted to tabular data containing discrete and continuous features. However, generating mixed-type features that combine discrete states with an otherwise continuous distribution in a single feature remains challenging. We advance the state-of-the-art in diffusion models for tabular data with a cascaded approach. We first generate a low-resolution version of a tabular data row, that is, the collection of the purely categorical features and a coarse categorical representation of numerical features. Next, this information is leveraged in the high-resolution flow matching model via a novel guided conditional probability path and data-dependent coupling. The low-resolution representation of numerical features explicitly accounts for discrete outcomes, such as missing or inflated values, and therewith enables a more faithful generation of mixed-type features. We formally prove that this cascade tightens the transport cost bound. The results indicate that our model generates significantly more realistic samples and captures distributional details more accurately, for example, the detection score improves by 51.9\%. Code is available at https://github.com/muellermarkus/tabcascade.