Jailbreaking Vision-Language Models Through the Visual Modality

arXiv cs.AI / 5/4/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that the visual modality in vision-language models (VLMs) is an underexplored pathway for bypassing safety alignment.
  • It presents four visual jailbreak strategies, including encoding harmful instructions as visual symbols, using object substitution (e.g., “bomb”→“banana”) while prompting for harmful actions, altering harmful text inside images while preserving contextual meaning, and using visual analogy puzzles requiring inference of prohibited concepts.
  • Tests on six frontier VLMs show that these visual attacks can successfully bypass safety alignment, revealing a “cross-modality alignment gap” where text-only safety training does not generalize to harmful intent conveyed visually.
  • The authors report a notable example where a visual cipher achieves 40.9% attack success on Claude-Haiku-4.5 compared with 10.7% for an equivalent text cipher, and they provide preliminary interpretability and mitigation directions.
  • The work concludes that robust VLM alignment should treat vision as a first-class target during safety post-training, not just rely on text-based safety measures.

Abstract

The visual modality of vision-language models (VLMs) is an underexplored attack surface for bypassing safety alignment. We introduce four jailbreak attacks exploiting the vision component: (1) encoding harmful instructions as visual symbol sequences with a decoding legend, (2) replacing harmful objects with benign substitutes (e.g., bomb -> banana) then prompting for harmful actions using the substitute term, (3) replacing harmful text in images (e.g., on book covers) with benign words while visual context preserves the original meaning, and (4) visual analogy puzzles whose solution requires inferring a prohibited concept. Evaluating across six frontier VLMs, our visual attacks bypass safety alignment and expose a cross-modality alignment gap: text-based safety training does not automatically generalize to harmful intent conveyed visually. For example, our visual cipher achieves 40.9% attack success on Claude-Haiku-4.5 versus 10.7% for an equivalent textual cipher. To further our insight into the attack mechanism, we present preliminary interpretability and mitigation results. These findings highlight that robust VLM alignment requires treating vision as a first-class target for safety post-training.