DP-CDA: An Algorithm for Enhanced Privacy Preservation in Dataset Synthesis Through Randomized Mixing

arXiv stat.ML / 4/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes DP-CDA, a data publishing algorithm that generates synthetic datasets by randomly mixing privacy-sensitive data in a class-specific way to reduce re-identification risks.
  • It introduces tuned randomness and provides formal privacy guarantees, with privacy accounting showing DP-CDA offers stronger protection than existing approaches.
  • The authors evaluate utility by training predictive models on the synthetic data and show that DP-CDA can deliver better accuracy under the same privacy constraints.
  • They also identify an optimal mixing order that improves the privacy–utility trade-off, particularly important for high-dimensional data where prior methods struggle.
  • Overall, DP-CDA aims to maintain strict privacy while improving practical usefulness of synthetic data for downstream machine learning tasks.

Abstract

In recent years, the growth of data across various sectors, including healthcare, security, finance, and education, has created significant opportunities for analysis and informed decision-making. However, these datasets often contain sensitive and personal information, which raises serious privacy concerns. It has been shown in multiple works that a person's identity is intertwined with their data, even if the data is anonymized. Due to this lack of separation between a person's identity and their information, the patterns associated with an individual's information can uniquely identify them. Protecting individual privacy is crucial, yet many existing machine learning and data publishing algorithms struggle with high-dimensional data, facing challenges related to the trade-off between computational efficiency and privacy. To address these challenges, we introduce an effective data publishing algorithm \emph{DP-CDA}. Our proposed algorithm generates synthetic data by randomly mixing the privacy-sensitive data in a class-specific manner and inducing carefully tuned randomness to ensure formal privacy guarantees. Our comprehensive privacy accounting shows that the proposed DP-CDA provides a stronger privacy guarantee compared to existing methods, allowing for better utility while maintaining a stricter level of privacy. To evaluate the effectiveness of DP-CDA, we examine the accuracy of predictive models trained on the synthetic data, which serves as a measure of dataset utility. Importantly, we identify an optimal order of mixing that balances privacy-utility trade-off. Our results indicate that synthetic datasets produced using the DP-CDA can achieve superior utility compared to those generated by conventional data publishing algorithms, even when subject to the same privacy requirements.