HQA-VLAttack: Towards High Quality Adversarial Attack on Vision-Language Pre-Trained Models
arXiv cs.CV / 4/21/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces HQA-VLAttack, a new black-box adversarial attack framework targeting vision-language pre-trained models where both text and image perturbations must be handled jointly.
- It improves text perturbation generation by using counter-fitting word vectors to build substitute word sets that maintain semantic consistency with the original text.
- For images, it initializes adversarial examples with a layer-importance guided strategy and then refines perturbations via contrastive learning to simultaneously reduce positive pair similarity and increase negative pair similarity.
- Experiments on three benchmark datasets show that HQA-VLAttack achieves substantially higher attack success rates than existing strong baselines, addressing limitations of prior query-heavy or less comprehensive approaches.
Related Articles

Claude and I aren't vibing at all
Dev.to

The ULTIMATE Guide to AI Voice Cloning: RVC WebUI (Zero to Hero)
Dev.to

From Generic to Granular: AI-Powered CMA Personalization for Solo Agents
Dev.to

Kiwi-chan Devlog #007: The Audit Never Sleeps (and Neither Does My GPU)
Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to