Differentially-Private Text Rewriting reshapes Linguistic Style

arXiv cs.CL / 4/30/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that text-level differential privacy has evolved from isolated word substitutions to contiguous sentence rewriting by using the generative capabilities of language models.
It finds that enforcing differential privacy affects more than vocabulary, causing a systematic change in the text’s communicative and stylistic signature.
Specifically, privacy-constrained rewriting substantially reduces interactive markers, contextual references, and complex subordinate structures that contribute to natural, human-like discourse.
Across different privacy budgets, comparisons between autoregressive paraphrasing and bidirectional substitution show both approaches converge toward a “non-involved, non-persuasive” register.
The authors conclude that while semantic meaning can be largely preserved, the structural homogenization of stylistic cues can erase aspects of linguistic identity in human-authored text.

Abstract

Differential Privacy (DP) for text matured from disjointed word-level substitutions to contiguous sentence-level rewriting by leveraging the generative capacity of language models. While this form of text privatization is best suited for balancing formal privacy guarantees with grammatical coherence, its impact on the register identity of text remains largely unexplored. By conducting a multidimensional stylistic profiling of differentially-private rewriting, we demonstrate that the cost of privacy extends far beyond lexical variation. Specifically, we find that rewriting under privacy constraints induces a systematic functional mutation of the text's communicative signature. This shift is characterized by the severe attrition of interactive markers, contextual references, and complex subordination. By comparing autoregressive paraphrasing against bidirectional substitution across a spectrum of privacy budgets, we observe that both architectures force convergence toward a non-involved and non-persuasive register. This register-blind sanitization effectively preserves semantic content but structurally homogenizes the nuanced stylistic markers that define human-authored discourse.