One Perturbation, Two Failure Modes: Probing VLM Safety via Embedding-Guided Typographic Perturbations

arXiv cs.CV / 4/29/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies how typographic prompt injection can bypass safety in vision-language models (VLMs) by manipulating text rendered inside images.
Across four VLMs, many font sizes, and multiple transformations, the authors find that multimodal embedding distance is a strong predictor of attack success rate, enabling an interpretable, model-agnostic proxy.
They argue the embedding-distance-to-attack-success relationship is mediated by two factors: perceptual readability (whether the VLM can parse the text) and safety alignment (whether it refuses to comply).
Using embedding-guided optimization with surrogate models, the authors create a red-teaming method that maximizes image-text embedding similarity under bounded perturbations to stress-test readability and safety refusals.
Experiments on several VLMs and degradation regimes show that the optimization simultaneously improves readability and reduces safety-aligned refusals, with the dominant failure mechanism varying by model and the level of visual distortion.

Abstract

Typographic prompt injection exploits vision language models' (VLMs) ability to read text rendered in images, posing a growing threat as VLMs power autonomous agents. Prior work typically focus on maximizing attack success rate (ASR) but does not explain \emph{why} certain renderings bypass safety alignment. We make two contributions. First, an empirical study across four VLMs including GPT-4o and Claude, twelve font sizes, and ten transformations reveals that multimodal embedding distance strongly predicts ASR (

r{=}{-}0.71

{-}0.93

p{<}0.01

), providing an interpretable, model agnostic proxy. Since embedding distance predicts ASR, reducing it should improve attack success, but the relationship is mediated by two factors: perceptual readability (whether the VLM can parse the text) and safety alignment (whether it refuses to comply). Second, we use this as a red teaming tool: we directly maximize image text embedding similarity under bounded

\ell_\infty

perturbations via CWA-SSA across four surrogate embedding models, stress testing both factors without access to the target model. Experiments across five degradation settings on GPT-4o, Claude Sonnet 4.5, Mistral-Large-3, and Qwen3-VL confirm that optimization recovers readability and reduces safety aligned refusals as two co-occurring effects, with the dominant mechanism depending on the model's safety filter strength and the degree of visual degradation.