FlowGuard: Towards Lightweight In-Generation Safety Detection for Diffusion Models via Linear Latent Decoding

arXiv cs.CV / 4/10/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • FlowGuard is proposed as a lightweight, cross-model framework for detecting NSFW/unsafe content during the in-generation denoising process of diffusion models, rather than only before or after image creation.
  • It targets latent diffusion’s challenge where early denoising steps contain heavy noise by using a novel linear latent-decoding approximation to recover safety-relevant signals efficiently.
  • The method incorporates curriculum learning to stabilize training and enable effective safety detection across intermediate steps.
  • Experiments on a cross-model benchmark covering nine diffusion backbones show improved in-generation NSFW detection (over 30% F1 score gains) in both in-distribution and out-of-distribution settings.
  • Reported efficiency improvements are substantial, including cutting peak GPU memory by over 97% and reducing projection time from 8.1s to 0.2s versus standard VAE decoding, while potentially enabling fewer diffusion steps when unsafe content is detected early.

Abstract

Diffusion-based image generation models have advanced rapidly but pose a safety risk due to their potential to generate Not-Safe-For-Work (NSFW) content. Existing NSFW detection methods mainly operate either before or after image generation. Pre-generation methods rely on text prompts and struggle with the gap between prompt safety and image safety. Post-generation methods apply classifiers to final outputs, but they are poorly suited to intermediate noisy images. To address this, we introduce FlowGuard, a cross-model in-generation detection framework that inspects intermediate denoising steps. This is particularly challenging in latent diffusion, where early-stage noise obscures visual signals. FlowGuard employs a novel linear approximation for latent decoding and leverages a curriculum learning approach to stabilize training. By detecting unsafe content early, FlowGuard reduces unnecessary diffusion steps to cut computational costs. Our cross-model benchmark spanning nine diffusion-based backbones shows the effectiveness of FlowGuard for in-generation NSFW detection in both in-distribution and out-of-distribution settings, outperforming existing methods by over 30% in F1 score while delivering transformative efficiency gains, including slashing peak GPU memory demand by over 97% and projection time from 8.1 seconds to 0.2 seconds compared to standard VAE decoding.