HQA-VLAttack: Towards High Quality Adversarial Attack on Vision-Language Pre-Trained Models

arXiv cs.CV / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces HQA-VLAttack, a new black-box adversarial attack framework targeting vision-language pre-trained models where both text and image perturbations must be handled jointly.
It improves text perturbation generation by using counter-fitting word vectors to build substitute word sets that maintain semantic consistency with the original text.
For images, it initializes adversarial examples with a layer-importance guided strategy and then refines perturbations via contrastive learning to simultaneously reduce positive pair similarity and increase negative pair similarity.
Experiments on three benchmark datasets show that HQA-VLAttack achieves substantially higher attack success rates than existing strong baselines, addressing limitations of prior query-heavy or less comprehensive approaches.

Abstract

Black-box adversarial attack on vision-language pre-trained models is a practical and challenging task, as text and image perturbations need to be considered simultaneously, and only the predicted results are accessible. Research on this problem is in its infancy, and only a handful of methods are available. Nevertheless, existing methods either rely on a complex iterative cross-search strategy, which inevitably consumes numerous queries, or only consider reducing the similarity of positive image-text pairs but ignore that of negative ones, which will also be implicitly diminished, thus inevitably affecting the attack performance. To alleviate the above issues, we propose a simple yet effective framework to generate high-quality adversarial examples on vision-language pre-trained models, named HQA-VLAttack, which consists of text and image attack stages. For text perturbation generation, it leverages the counter-fitting word vector to generate the substitute word set, thus guaranteeing the semantic consistency between the substitute word and the original word. For image perturbation generation, it first initializes the image adversarial example via the layer-importance guided strategy, and then utilizes contrastive learning to optimize the image adversarial perturbation, which ensures that the similarity of positive image-text pairs is decreased while that of negative image-text pairs is increased. In this way, the optimized adversarial images and texts are more likely to retrieve negative examples, thereby enhancing the attack success rate. Experimental results on three benchmark datasets demonstrate that HQA-VLAttack significantly outperforms strong baselines in terms of attack success rate.