Can synthetic data reproduce real-world findings in epidemiology? A replication study using adversarial random forests
arXiv stat.ML / 2026/3/24
💬 オピニオンIdeas & Deep AnalysisTools & Practical UsageModels & Research
要点
- The paper examines whether synthetic tabular data can reproduce epidemiology study findings while also preserving privacy, addressing gaps in prior evaluation methods.
- It proposes adversarial random forests (ARF) as an efficient, non-expert-friendly approach for synthesizing epidemiological datasets.
- Using replications of analyses from six epidemiological publications across multiple real-world cohorts/registries, the authors report that ARF-generated synthetic data consistently reproduced both descriptive and inferential results.
- The study finds that lower dimensionality and simpler variables improve synthetic data quality, and that ARF performs favorably versus common tabular data synthesizers on utility, privacy, generalisation, and runtime.
- The work also highlights that many existing synthetic-data evaluations may not adequately capture statistical utility and privacy risk, motivating more directly relevant assessment practices.

