Jailbreaking Vision-Language Models Through the Visual Modality

arXiv cs.AI / 5/4/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that the visual modality in vision-language models (VLMs) is an underexplored pathway for bypassing safety alignment.
It presents four visual jailbreak strategies, including encoding harmful instructions as visual symbols, using object substitution (e.g., “bomb”→“banana”) while prompting for harmful actions, altering harmful text inside images while preserving contextual meaning, and using visual analogy puzzles requiring inference of prohibited concepts.
Tests on six frontier VLMs show that these visual attacks can successfully bypass safety alignment, revealing a “cross-modality alignment gap” where text-only safety training does not generalize to harmful intent conveyed visually.
The authors report a notable example where a visual cipher achieves 40.9% attack success on Claude-Haiku-4.5 compared with 10.7% for an equivalent text cipher, and they provide preliminary interpretability and mitigation directions.
The work concludes that robust VLM alignment should treat vision as a first-class target during safety post-training, not just rely on text-based safety measures.

Abstract

The visual modality of vision-language models (VLMs) is an underexplored attack surface for bypassing safety alignment. We introduce four jailbreak attacks exploiting the vision component: (1) encoding harmful instructions as visual symbol sequences with a decoding legend, (2) replacing harmful objects with benign substitutes (e.g., bomb -> banana) then prompting for harmful actions using the substitute term, (3) replacing harmful text in images (e.g., on book covers) with benign words while visual context preserves the original meaning, and (4) visual analogy puzzles whose solution requires inferring a prohibited concept. Evaluating across six frontier VLMs, our visual attacks bypass safety alignment and expose a cross-modality alignment gap: text-based safety training does not automatically generalize to harmful intent conveyed visually. For example, our visual cipher achieves 40.9% attack success on Claude-Haiku-4.5 versus 10.7% for an equivalent textual cipher. To further our insight into the attack mechanism, we present preliminary interpretability and mitigation results. These findings highlight that robust VLM alignment requires treating vision as a first-class target for safety post-training.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 5/4DailyView insight →

A very basic litmus test for LLMs "ok give me a python program that reads my c: and put names and folders in a sorted list from biggest to small"

Reddit r/LocalLLaMA

ALM on Power Platform: ADO + GitHub, the best of both worlds

Dev.to

Iron Will, Iron Problems: Kiwi-chan's Mining Misadventures! 🥝⛏️

Dev.to

Experiment: Does repeated usage influence ChatGPT 5.4 outputs in a RAG-like setup?

Dev.to

Find 12 high-volume, low-competition GEO content topics Topify.ai should rank on

Dev.to

Jailbreaking Vision-Language Models Through the Visual Modality

Key Points

Abstract

💡 Insights using this article

Related Articles

A very basic litmus test for LLMs "ok give me a python program that reads my c: and put names and folders in a sorted list from biggest to small"

ALM on Power Platform: ADO + GitHub, the best of both worlds

Iron Will, Iron Problems: Kiwi-chan's Mining Misadventures! 🥝⛏️

Experiment: Does repeated usage influence ChatGPT 5.4 outputs in a RAG-like setup?

Find 12 high-volume, low-competition GEO content topics Topify.ai should rank on

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer