AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of Vision-Language Models

arXiv cs.CV / 4/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses a problem where pre-trained vision-language models (VLMs) are vulnerable to adversarial perturbations despite strong zero-shot performance.
  • It argues that existing label-based adversarial fine-tuning can break the models’ cross-modal alignment, harming image-text correspondence and reducing zero-shot accuracy.
  • It introduces Alignment-Guided Fine-Tuning (AGFT), which uses the original model’s probabilistic (soft) predictions to guide adversarial training while preserving the relative structure between visual features and textual embeddings.
  • To mitigate fine-tuning-induced structural shifts, AGFT adds a distribution consistency calibration step that aligns the robust model’s outputs with a temperature-scaled version of the pre-trained model.
  • Experiments across multiple zero-shot benchmarks show AGFT outperforms prior state-of-the-art approaches, yielding stronger zero-shot adversarial robustness without sacrificing cross-modal semantics.

Abstract

Pre-trained vision-language models (VLMs) exhibit strong zero-shot generalization but remain vulnerable to adversarial perturbations. Existing classification-guided adversarial fine-tuning methods often disrupt pre-trained cross-modal alignment, weakening visual-textual correspondence and degrading zero-shot performance. In this paper, we propose an Alignment-Guided Fine-Tuning (AGFT) framework that enhances zero-shot adversarial robustness while preserving the cross-modal semantic structure. Unlike label-based methods that rely on hard labels and fail to maintain the relative relationships between image and text, AGFT leverages the probabilistic predictions of the original model for text-guided adversarial training, which aligns adversarial visual features with textual embeddings via soft alignment distributions, improving zero-shot adversarial robustness. To address structural discrepancies introduced by fine-tuning, we introduce a distribution consistency calibration mechanism that adjusts the robust model output to match a temperature-scaled version of the pre-trained model predictions. Extensive experiments across multiple zero-shot benchmarks demonstrate that AGFT outperforms state-of-the-art methods while significantly improving zero-shot adversarial robustness.