Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks
arXiv cs.CV / 4/15/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that security research on vision-language models (VLMs) has mostly stayed in the digital domain, leaving real-world physical threats largely unexplored despite growing deployments.
- It introduces Multimodal Semantic Lighting Attacks (MSLA), a physically deployable adversarial framework that uses controllable lighting to target semantic alignment rather than just output labels.
- Experiments show MSLA can degrade zero-shot classification performance of common CLIP variants and cause severe semantic hallucinations in VLMs such as LLaVA and BLIP across image captioning and VQA.
- Results in both digital and physical settings indicate MSLA is effective, transferable, and practically realizable, revealing a robustness gap specific to physical-world attacks.
- The authors conclude that VLMs are highly vulnerable to physically realizable semantic attacks and call for urgent physical-world robustness evaluation to assess real deployment risk.




