Reading Between the Pixels: An Inscriptive Jailbreak Attack on Text-to-Image Models

arXiv cs.CV / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper identifies a new “inscriptive jailbreak” threat for text-to-image (T2I) models that can force the generation of images containing harmful, legible paragraph-length text (e.g., fraudulent documents) embedded in otherwise benign scenes.
  • It argues this differs from earlier “depictive” jailbreaks because the attack weaponizes character-level text-rendering fidelity, making prior coarse visual-manipulation defenses less effective.
  • The authors propose Etch, a black-box attack framework that splits an adversarial prompt into three orthogonal layers—semantic camouflage, visual-spatial anchoring, and typographic encoding—and iteratively refines them via a zero-order optimization loop.
  • A vision-language model is used to critique generated images, localize which layer(s) fail, and recommend targeted prompt revisions, enabling higher character-level control.
  • Experiments across 7 T2I models on two benchmarks report an average attack success rate of 65.57% with a peak of 91.00%, highlighting a typography-aware defense gap in current multimodal safety alignments.

Abstract

Modern text-to-image (T2I) models can now render legible, paragraph-length text, enabling a fundamentally new class of misuse. We identify and formalize the inscriptive jailbreak, where an adversary coerces a T2I system into generating images containing harmful textual payloads (e.g., fraudulent documents) embedded within visually benign scenes. Unlike traditional depictive jailbreaks that elicit visually objectionable imagery, inscriptive attacks weaponize the text-rendering capability itself. Because existing jailbreak techniques are designed for coarse visual manipulation, they struggle to bypass multi-stage safety filters while maintaining character-level fidelity. To expose this vulnerability, we propose Etch, a black-box attack framework that decomposes the adversarial prompt into three functionally orthogonal layers: semantic camouflage, visual-spatial anchoring, and typographic encoding. This decomposition reduces joint optimization over the full prompt space to tractable sub-problems, which are iteratively refined through a zero-order loop. In this process, a vision-language model critiques each generated image, localizes failures to specific layers, and prescribes targeted revisions. Extensive evaluations across 7 models on the 2 benchmarks demonstrate that Etch achieves an average attack success rate of 65.57% (peaking at 91.00%), significantly outperforming existing baselines. Our results reveal a critical blind spot in current T2I safety alignments and underscore the urgent need for typography-aware defense multimodal mechanisms.