TabSCM: A practical Framework for Generating Realistic Tabular Data

arXiv cs.LG / 4/27/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

TabSCM is a tabular data generation framework designed to preserve causal structure, not just marginal statistics, to reduce spurious or unfair patterns learned by downstream models.
It builds a causal DAG from a CPDAG obtained via causal structure discovery, then models root-node marginals and generates child-node values using conditional diffusion models (continuous) and gradient-boosted trees (categorical).
The method uses ancestral sampling to produce semantically valid synthetic records and supports exact counterfactual queries and robust conditional interventions.
Across seven public datasets (healthcare, finance, housing, and more), TabSCM matches or exceeds prior GAN/diffusion/LLM baselines in statistical fidelity, downstream utility, and privacy risk, while lowering rule-violation rates.
Because generation is expressed as explicit equations, TabSCM can be up to 583× faster than diffusion-only approaches and provides interpretable controls for fairness auditing and policy simulation.

Abstract

Most tabular-data generators match marginal statistics yet ignore causal structure, leading downstream models to learn spurious or unfair patterns. We present TabSCM, a mixed-type generator that preserves those causal dependencies. Starting from a Completed Partially Directed Acyclic Graph (CPDAG) found by any causal structure discovery algorithm, TabSCM (i) orients edges to a DAG, (ii) fits root-node marginals with KDE or categorical frequencies, and (iii) learns topologically ordered structural assignments. Such assignments are achieved using conditional diffusion models for continuous variables as child nodes and gradient-boosted trees for categorical ones. Ancestral sampling yields semantically valid records and enables exact counterfactual queries. On seven public datasets, encompassing healthcare, finance, housing, environment, TabSCM matches or surpasses state-of-the-art GAN, diffusion, and LLM baselines in statistical fidelity, downstream utility, and privacy risk, while also cutting rule-violation rates and providing causally meaningful and robust conditional interventions. Because generation is decomposed into explicit equations, it runs up to 583

\times

faster than diffusion-only models and exposes interpretable knobs for fairness auditing and policy simulation, making TabSCM a practical choice for realism, explainability, and causal soundness.

Black Hat USA

AI Business

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Everyone Wants AI Agents. Fewer Teams Are Ready for the Messy Business Context Behind Them

Dev.to

How to Build Traceable and Evaluated LLM Workflows Using Promptflow, Prompty, and OpenAI

MarkTechPost

AI 编程工具对比 2026：Claude Code vs Cursor vs Gemini CLI vs Codex

Dev.to

TabSCM: A practical Framework for Generating Realistic Tabular Data

Key Points

Abstract

Related Articles

Black Hat USA

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Everyone Wants AI Agents. Fewer Teams Are Ready for the Messy Business Context Behind Them

How to Build Traceable and Evaluated LLM Workflows Using Promptflow, Prompty, and OpenAI

AI 编程工具对比 2026：Claude Code vs Cursor vs Gemini CLI vs Codex

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer