DiffAnon: Diffusion-based Prosody Control for Voice Anonymization

arXiv cs.LG / 4/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses a core challenge in voice anonymization: prosody carries meaning and emotion but is also tied to speaker identity.
  • It introduces DiffAnon, a diffusion-based anonymization framework that uses classifier-free guidance to enable continuous, explicit control over prosody preservation during inference.
  • DiffAnon refines acoustic details over semantic embeddings of an RVQ codec, allowing smooth interpolation between stronger anonymization and higher prosodic fidelity within one model.
  • Experiments show structured utility–privacy trade-offs, with strong utility while maintaining competitive privacy across multiple controllable operating points.
  • The authors claim DiffAnon is the first voice anonymization approach to offer structured, interpolatable prosody control at inference time.

Abstract

To preserve or not to preserve prosody is a central question in voice anonymization. Prosody conveys meaning and affect, yet is tightly coupled with speaker identity. Existing methods either discard prosody for privacy or lack a principled mechanism to control the utility-privacy trade-off, operating at fixed design points. We propose DiffAnon, a diffusion-based anonymization method with classifier-free guidance (CFG) that provides explicit, continuous inference-time control over prosody preservation. DiffAnon refines acoustic detail over semantic embeddings of an RVQ codec, enabling smooth interpolation between anonymization strength and prosodic fidelity within a single model. To the best of our knowledge, it is the first voice anonymization framework to provide structured, interpolatable inference-time prosody control. Experiments demonstrate structured trade-off behavior, achieving strong utility while maintaining competitive privacy across controllable operating points.