Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models

arXiv cs.CV / 4/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies how infrared vision-language models (IR-VLMs), despite being promising for low-visibility perception, remain vulnerable to physical-world adversarial attacks that are not well addressed by prior RGB-focused methods.
It introduces Universal Curved-Grid Patch (UCGP), a deployable universal adversarial patch framework that uses curved-grid mesh parameterization and a representation-level objective (e.g., subspace departure, topology disruption, and stealth) rather than changing prompts or labels.
The method further improves real-world robustness under domain shift by combining Meta Differential Evolution with EOT-augmented TPS deformation modeling to better simulate physical transformations.
Experiments show UCGP reliably degrades semantic understanding across multiple IR-VLM architectures, with strong transferability across models and datasets and demonstrable effectiveness in real physical settings.
Overall, the work highlights a previously underappreciated robustness weakness in infrared multimodal systems and suggests existing defenses may not cover representation-space disruption threats.

Abstract

Infrared vision-language models (IR-VLMs) have emerged as a promising paradigm for multimodal perception in low-visibility environments, yet their robustness to adversarial attacks remains largely unexplored. Existing adversarial patch methods are mainly designed for RGB-based models in closed-set settings and are not readily applicable to the open-ended semantic understanding and physical deployment requirements of infrared VLMs. To bridge this gap, we propose Universal Curved-Grid Patch (UCGP), a universal physical adversarial patch framework for IR-VLMs. UCGP integrates Curved-Grid Mesh (CGM) parameterization for continuous, low-frequency, and deployable patch generation with a unified representation-driven objective that promotes subspace departure, topology disruption, and stealth. To improve robustness under real-world deployment and domain shift, we further incorporate Meta Differential Evolution and EOT-augmented TPS deformation modeling. Rather than manipulating labels or prompts, UCGP directly disrupts the visual representation space, weakening cross-modal semantic alignment. Extensive experiments demonstrate that UCGP consistently compromises semantic understanding across diverse IR-VLM architectures while maintaining cross-model transferability, cross-dataset generalization, real-world physical effectiveness, and robustness against defenses. These findings reveal a previously overlooked robustness vulnerability in current infrared multimodal systems.