Differentially Private Linear Regression and Synthetic Data Generation with Statistical Guarantees

arXiv stat.ML / 3/31/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper targets privacy-aware social-science workloads by extending differentially private (DP) linear regression from point estimation to uncertainty quantification via statistically valid inference under Gaussian DP.
It introduces a bias-corrected estimator that supports asymptotic confidence intervals, enabling researchers to report uncertainty in DP regression outputs.
The authors also propose a DP synthetic data generation (SDG) procedure designed so that running regression on the synthetic data matches the proposed DP linear regression procedure.
Experiments indicate the method improves accuracy, yields valid confidence intervals, and produces synthetic data that is more reliable for downstream statistical analyses and machine learning than existing DP synthesizers.
The approach is positioned as effective for small- to moderate-dimensional settings, aligning with common dataset sizes in the social sciences.

Abstract

In the social sciences, small- to medium-scale datasets are common, and linear regression is canonical. In privacy-aware settings, much work has focused on differentially private (DP) linear regression, but mostly on point estimation with limited attention to uncertainty quantification. Meanwhile, synthetic data generation (SDG) is increasingly important for reproducibility studies, yet current DP linear regression methods do not readily support it. Mainstream DP-SDG approaches either are tailored to discrete or discretized data, making them less suitable for analyses involving continuous variables, or rely on deep learning models that require large datasets, limiting their use for the smaller-scale data typical in social science. We propose a method for linear regression with valid inference under Gaussian DP. It includes a bias-corrected estimator with asymptotic confidence intervals (CIs) and a general SDG procedure such that the corresponding regression on the synthetic data matches our DP linear regression procedure. Our approach is effective in small- to moderate-dimensional settings. Experiments show that our method (1) improves accuracy over existing methods for DP linear regression, (2) provides valid CIs, and (3) produces more reliable synthetic data for downstream statistical and machine learning tasks than current DP synthesizers.