Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models

arXiv cs.CV / 4/15/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper analyzes typographic prompt-injection attacks on vision-language models by rendering adversarial text as images, targeting VLMs used in autonomous/agentic systems.
Experiments across four VLMs (GPT-4o, Claude Sonnet 4.5, Mistral-Large-3, Qwen3-VL-4B) and 1,000 SALAD-Bench prompts show that font size strongly drives attack success rate, with mid-range fonts performing best while 6px is near-zero.
Attack effectiveness depends on the VLM and modality: text attacks outperform image attacks for GPT-4o and Claude, while Qwen3-VL and Mistral show more similar success across modalities.
The study finds a strong negative correlation between ASR and text-image embedding distance computed with multimodal embedding models (JinaCLIP, Qwen3-VL-Embedding), linking success to alignment quality.
It also observes that heavy visual degradations increase embedding distance and substantially reduce ASR, with rotation affecting models asymmetrically—implying defenses must account for backbone-specific robustness rather than using one-size-fits-all rules.

Abstract

We study typographic prompt injection attacks on vision-language models (VLMs), where adversarial text is rendered as images to bypass safety mechanisms, posing a growing threat as VLMs serve as the perceptual backbone of autonomous agents, from browser automation and computer-use systems to camera-equipped embodied agents. In practice, the attack surface is heterogeneous: adversarial text appears at varying font sizes and under diverse visual conditions, while the growing ecosystem of VLMs exhibits substantial variation in vulnerability, complicating defensive approaches. Evaluating 1,000 prompts from SALAD-Bench across four VLMs, namely, GPT-4o, Claude Sonnet 4.5, Mistral-Large-3, and Qwen3-VL-4B-Instruct under varying font sizes (6--28px) and visual transformations (rotation, blur, noise, contrast changes), we find: (1) font size significantly affects attack success rate (ASR), with very small fonts (6px) yielding near-zero ASR while mid-range fonts achieve peak effectiveness; (2) text attacks are more effective than image attacks for GPT-4o (36% vs 8%) and Claude (47% vs 22%), while Qwen3-VL and Mistral show comparable ASR across modalities; (3) text-image embedding distance from two multimodal embedding models (JinaCLIP and Qwen3-VL-Embedding) shows strong negative correlation with ASR across all four models (r = -0.71 to -0.93, p < 0.01); (4) heavy degradations increase embedding distance by 10--12% and reduce ASR by 34--96%, while rotation asymmetrically affects models (Mistral drops 50%, GPT-4o unchanged). These findings highlight that model-specific robustness patterns preclude one-size-fits-all defenses and offer empirical guidance for practitioners selecting VLM backbones for agentic systems operating in adversarial environments.