Disciplined Diffusion: Text-to-Image Diffusion Model against NSFW Generation

arXiv cs.CV / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • Text-to-image diffusion models can generate NSFW or harmful imagery from malicious prompts, and common safety approaches use binary allow/block filtering that is both vulnerable to evasion and prone to false alarms.
  • The proposed Disciplined Diffusion (DDiffusion) aims to improve robustness by detecting implicit harmful semantics within prompt embeddings rather than relying on brittle keyword/pairwise similarity.
  • DDiffusion introduces a semantic retrieval step that evaluates prompts against concept distributions, and a localization/editing method that targets and modifies only the harmful regions during image generation.
  • Instead of uniformly blocking outputs, DDiffusion returns locally sanitized images, aiming to suppress malicious content while maintaining output quality for benign prompts and reducing reliance on binary signals that adversarial probing can exploit.

Abstract

Text-to-image (T2I) diffusion models have the ability to build high-quality pictures from text prompts, but they pose safety concerns because they can generate offensive or disturbing imagery when provided with harmful inputs. Existing safety filters typically rely on text-based classifiers or image-based checkers that completely block the output upon detecting a threat, issuing an explicit allow/block feedback signal to the user. This binary strategy leaves models vulnerable to adversarial attacks that alter keywords to bypass detection, and it causes high false-alarm rates that degrade the experience for benign users. To address such vulnerabilities, we propose Disciplined Diffusion (DDiffusion), a novel robust text-to-image diffusion that counters Not Safe For Work (NSFW) generation by uncovering implicit malicious semantics in prompt embeddings. DDiffusion leverages a semantic retrieval mechanism to evaluate prompts against concept distributions rather than relying on brittle pairwise similarity. Furthermore, it employs a localization method during the diffusion process to selectively edit only the harmful regions of the generated image. By returning locally sanitized images instead of applying uniform blocking, DDiffusion suppresses malicious content while preserving generation fidelity for benign prompts and avoiding the binary allow-deny signal on which existing probing attacks rely.