Memorization In Stable Diffusion Is Unexpectedly Driven by CLIP Embeddings

arXiv cs.CV / 5/6/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper analyzes why text-to-image diffusion models like Stable Diffusion can unexpectedly memorize training content, focusing specifically on how CLIP text embeddings drive that behavior.
It finds that in memorization-heavy cases the prompt embedding contributes little, while the <pad> embedding strongly increases memorization because it structurally duplicates the <endoftext> embedding.
The study argues that this duplication amplifies the influence of the <endoftext> embedding (which is explicitly optimized during CLIP training), leading the model to over-rely on it and thereby memorize.
To mitigate memorization at inference time, the authors propose two simple embedding masking/replacement strategies that suppress memorization without quality degradation and require no prior detection of memorization.
The results highlight a safety- and interpretability-relevant mechanism: token-embedding side effects in CLIP can translate into memorization risk in diffusion generation.

Abstract

Understanding how textual embeddings contribute to memorization in text-to-image diffusion models is crucial for both interpretability and safety. This paper investigates an unexpected behavior of CLIP embeddings in Stable Diffusion, revealing that the model disproportionately relies on specific embeddings. We categorize input tokens as , , and with corresponding embeddings

\mathbf{v}^{\mathbf{sot}}, \mathbf{v}^{\mathbf{pr}}, \mathbf{v}^{\mathbf{eot}}, \mathbf{v}^{\mathbf{pad}}

. We discover that

\mathbf{v}^{\mathbf{pr}}

contribute minimally to generation in memorized cases. In contrast,

\mathbf{v}^{\mathbf{pad}}

strongly affect memorization due to their structural duplication of

\mathbf{v}^{\mathbf{eot}}

, the only embedding explicitly optimized during CLIP training. This duplication unintentionally amplifies the influence of

\mathbf{v}^{\mathbf{eot}}

, causing the model to over-rely on it, thereby driving memorization. Based on these observations, we propose two simple yet effective inference-time mitigation strategies: (1) Replacing the tokenizer's default from to the ! token before embedding, and masking the

\mathbf{v}^{\mathbf{eot}}

; (2) Partial masking of

\mathbf{v}^{\mathbf{pad}}

. Both suppress memorization without degrading quality, and are readily deployable without prior detection.

Antwerp startup Maurice & Nora raises €1M to address rising care demand

Tech.eu

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide

Dev.to

Discover Amazing AI Bots in EClaw's Bot Plaza: The GitHub for AI Personalities

Dev.to

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'

Dev.to

Amd radeon ai pro r9700 32GB VS 2x RTX 5060TI 16GB for local setup?

Reddit r/LocalLLaMA

Memorization In Stable Diffusion Is Unexpectedly Driven by CLIP Embeddings

Key Points

Abstract

Related Articles

Antwerp startup Maurice & Nora raises €1M to address rising care demand

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide

Discover Amazing AI Bots in EClaw's Bot Plaza: The GitHub for AI Personalities

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'

Amd radeon ai pro r9700 32GB VS 2x RTX 5060TI 16GB for local setup?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

Abstract

Related Articles

Antwerp startup Maurice &amp; Nora raises €1M to address rising care demand

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide

Discover Amazing AI Bots in EClaw's Bot Plaza: The GitHub for AI Personalities

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'

Amd radeon ai pro r9700 32GB VS 2x RTX 5060TI 16GB for local setup?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Antwerp startup Maurice & Nora raises €1M to address rising care demand