Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks

arXiv cs.CV / 4/15/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that security research on vision-language models (VLMs) has mostly stayed in the digital domain, leaving real-world physical threats largely unexplored despite growing deployments.
It introduces Multimodal Semantic Lighting Attacks (MSLA), a physically deployable adversarial framework that uses controllable lighting to target semantic alignment rather than just output labels.
Experiments show MSLA can degrade zero-shot classification performance of common CLIP variants and cause severe semantic hallucinations in VLMs such as LLaVA and BLIP across image captioning and VQA.
Results in both digital and physical settings indicate MSLA is effective, transferable, and practically realizable, revealing a robustness gap specific to physical-world attacks.
The authors conclude that VLMs are highly vulnerable to physically realizable semantic attacks and call for urgent physical-world robustness evaluation to assess real deployment risk.

Abstract

Vision-Language Models (VLMs) have shown remarkable performance, yet their security remains insufficiently understood. Existing adversarial studies focus almost exclusively on the digital setting, leaving physical-world threats largely unexplored. As VLMs are increasingly deployed in real environments, this gap becomes critical, since adversarial perturbations must be physically realizable. Despite this practical relevance, physical attacks against VLMs have not been systematically studied. Such attacks may induce recognition failures and further disrupt multimodal reasoning, leading to severe semantic misinterpretation in downstream tasks. Therefore, investigating physical attacks on VLMs is essential for assessing their real-world security risks. To address this gap, we propose Multimodal Semantic Lighting Attacks (MSLA), the first physically deployable adversarial attack framework against VLMs. MSLA uses controllable adversarial lighting to disrupt multimodal semantic understanding in real scenes, attacking semantic alignment rather than only task-specific outputs. Consequently, it degrades zero-shot classification performance of mainstream CLIP variants while inducing severe semantic hallucinations in advanced VLMs such as LLaVA and BLIP across image captioning and visual question answering (VQA). Extensive experiments in both digital and physical domains demonstrate that MSLA is effective, transferable, and practically realizable. Our findings provide the first evidence that VLMs are highly vulnerable to physically deployable semantic attacks, exposing a previously overlooked robustness gap and underscoring the urgent need for physical-world robustness evaluation of VLMs.