On the Price of Privacy for Language Identification and Generation

arXiv cs.LG / 4/9/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies the fundamental privacy cost of differentially private (DP) language identification and text generation when learning from sensitive user data.
  • It derives matching algorithms and lower bounds showing that with approximate (ε, δ)-DP (for constant ε > 0), the private error rates can recover non-private performance rates for both identification and generation.
  • Under pure ε-DP, the paper shows the privacy penalty manifests as an exponent degradation by a tight multiplicative factor of min{1, ε}, quantifying how much accuracy is lost.
  • It finds that generation under pure DP achieves an optimal rate (up to constants) under mild assumptions, indicating the privacy cost can be precisely characterized.
  • Overall, the authors conclude that the “price of privacy” in language learning is surprisingly mild—absent under approximate DP and limited to a min{1, ε} factor under pure DP.

Abstract

As large language models (LLMs) are increasingly trained on sensitive user data, understanding the fundamental cost of privacy in language learning becomes essential. We initiate the study of differentially private (DP) language identification and generation in the agnostic statistical setting, establishing algorithms and matching lower bounds that precisely quantify the cost of privacy. For both tasks, approximate (\varepsilon, \delta)-DP with constant \varepsilon > 0 recovers the non-private error rates: \exp(-r(n)) for identification (for any r(n) = o(n)) and \exp(-\Omega(n)) for generation. Under pure \varepsilon-DP, the exponents degrade by a multiplicative factor of \min\{1, \varepsilon\}, which we show is tight up to constants. Notably, for generation under pure DP with mild assumptions, the upper bound \exp(-\min\{1,\varepsilon\} \cdot \Omega(n)) matches the lower bound up to some constants, establishing an optimal rate. Our results show that the cost of privacy in language learning is surprisingly mild: absent entirely under approximate DP, and exactly a \min\{1,\varepsilon\} factor in the exponent under pure DP.