Low-Effort Jailbreak Attacks Against Text-to-Image Safety Filters
arXiv cs.CV / 2026/4/3
💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research
要点
- The paper demonstrates that widely used text-to-image generative systems can be bypassed using low-effort, prompt-only “jailbreak” attacks that do not require model access, optimization, or adversarial training.
- It proposes a taxonomy of visual jailbreak techniques (e.g., artistic reframing, material substitution, pseudo-educational framing, lifestyle aesthetic camouflage, and ambiguous action substitution) that hide harmful intent within seemingly benign language.
- Evaluations across multiple state-of-the-art text-to-image models show that simple linguistic modifications can reliably evade existing safety filters, with reported attack success rates up to 74.47%.
- The findings argue there is a major mismatch between surface-level prompt moderation and the deeper semantic understanding needed to detect adversarial intent in generative image pipelines.




