GroundingAnomaly: Spatially-Grounded Diffusion for Few-Shot Anomaly Synthesis

arXiv cs.CV / 4/10/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces GroundingAnomaly, a few-shot anomaly synthesis framework aimed at improving visual anomaly inspection in industrial quality control where anomalous samples are scarce.
  • It adds a Spatial Conditioning Module that uses per-pixel semantic maps to provide precise spatial control over where synthetic anomalies appear.
  • It proposes a Gated Self-Attention Module that injects conditioning tokens into a frozen U-Net via gated attention layers to maintain pretrained priors while enabling stable few-shot adaptation.
  • Experiments on MVTec AD and VisA show that GroundingAnomaly produces high-quality anomaly images and delivers state-of-the-art results on downstream anomaly detection, segmentation, and instance-level detection tasks.

Abstract

The performance of visual anomaly inspection in industrial quality control is often constrained by the scarcity of real anomalous samples. Consequently, anomaly synthesis techniques have been developed to enlarge training sets and enhance downstream inspection. However, existing methods either suffer from poor integration caused by inpainting or fail to provide accurate masks. To address these limitations, we propose GroundingAnomaly, a novel few-shot anomaly image generation framework. Our framework introduces a Spatial Conditioning Module that leverages per-pixel semantic maps to enable precise spatial control over the synthesized anomalies. Furthermore, a Gated Self-Attention Module is designed to inject conditioning tokens into a frozen U-Net via gated attention layers. This carefully preserves pretrained priors while ensuring stable few-shot adaptation. Extensive evaluations on the MVTec AD and VisA datasets demonstrate that GroundingAnomaly generates high-quality anomalies and achieves state-of-the-art performance across multiple downstream tasks, including anomaly detection, segmentation, and instance-level detection.