Memorization In Stable Diffusion Is Unexpectedly Driven by CLIP Embeddings

arXiv cs.CV / 5/6/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper analyzes why text-to-image diffusion models like Stable Diffusion can unexpectedly memorize training content, focusing specifically on how CLIP text embeddings drive that behavior.
  • It finds that in memorization-heavy cases the prompt embedding contributes little, while the <pad> embedding strongly increases memorization because it structurally duplicates the <endoftext> embedding.
  • The study argues that this duplication amplifies the influence of the <endoftext> embedding (which is explicitly optimized during CLIP training), leading the model to over-rely on it and thereby memorize.
  • To mitigate memorization at inference time, the authors propose two simple embedding masking/replacement strategies that suppress memorization without quality degradation and require no prior detection of memorization.
  • The results highlight a safety- and interpretability-relevant mechanism: token-embedding side effects in CLIP can translate into memorization risk in diffusion generation.

Abstract

Understanding how textual embeddings contribute to memorization in text-to-image diffusion models is crucial for both interpretability and safety. This paper investigates an unexpected behavior of CLIP embeddings in Stable Diffusion, revealing that the model disproportionately relies on specific embeddings. We categorize input tokens as , , and with corresponding embeddings \mathbf{v}^{\mathbf{sot}}, \mathbf{v}^{\mathbf{pr}}, \mathbf{v}^{\mathbf{eot}}, \mathbf{v}^{\mathbf{pad}}. We discover that \mathbf{v}^{\mathbf{pr}} contribute minimally to generation in memorized cases. In contrast, \mathbf{v}^{\mathbf{pad}} strongly affect memorization due to their structural duplication of \mathbf{v}^{\mathbf{eot}}, the only embedding explicitly optimized during CLIP training. This duplication unintentionally amplifies the influence of \mathbf{v}^{\mathbf{eot}}, causing the model to over-rely on it, thereby driving memorization. Based on these observations, we propose two simple yet effective inference-time mitigation strategies: (1) Replacing the tokenizer's default from to the ! token before embedding, and masking the \mathbf{v}^{\mathbf{eot}}; (2) Partial masking of \mathbf{v}^{\mathbf{pad}}. Both suppress memorization without degrading quality, and are readily deployable without prior detection.