Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models
arXiv cs.CV / 4/15/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper analyzes typographic prompt-injection attacks on vision-language models by rendering adversarial text as images, targeting VLMs used in autonomous/agentic systems.
- Experiments across four VLMs (GPT-4o, Claude Sonnet 4.5, Mistral-Large-3, Qwen3-VL-4B) and 1,000 SALAD-Bench prompts show that font size strongly drives attack success rate, with mid-range fonts performing best while 6px is near-zero.
- Attack effectiveness depends on the VLM and modality: text attacks outperform image attacks for GPT-4o and Claude, while Qwen3-VL and Mistral show more similar success across modalities.
- The study finds a strong negative correlation between ASR and text-image embedding distance computed with multimodal embedding models (JinaCLIP, Qwen3-VL-Embedding), linking success to alignment quality.
- It also observes that heavy visual degradations increase embedding distance and substantially reduce ASR, with rotation affecting models asymmetrically—implying defenses must account for backbone-specific robustness rather than using one-size-fits-all rules.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.




