Hierarchically Robust Zero-shot Vision-language Models
arXiv cs.AI / 4/22/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper addresses a weakness of vision-language models (VLMs) in zero-shot classification: they can be vulnerable to adversarial attacks.
- It argues that prior robust fine-tuning methods that align fixed text embeddings with image embeddings can hurt both natural performance and robustness.
- The authors propose a hierarchical adversarial fine-tuning framework that uses hierarchical embeddings and performs multi-level robust alignment between image and text modalities.
- They introduce additional mechanisms to place visual embeddings at the appropriate depth in the class hierarchy and provide a theoretical link between hierarchy depth and the maximum feasible margin size.
- Experiments on multiple datasets show the method improves adversarial robustness, including by aligning across multiple hierarchy trees to increase semantic variety.
