What Makes VLMs Robust? Towards Reconciling Robustness and Accuracy in Vision-Language Models
arXiv cs.CV / 3/16/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The study shows adversarial robustness in Vision-Language Models is concentrated in shallow layers due to a low-frequency spectral bias and input-insensitive attention, challenging the assumption that deeper layers drive robustness.
- Updates to deep layers tend to undermine both clean accuracy and robust generalization, indicating robustness varies non-uniformly across network depth.
- They propose Adversarial Robustness Adaptation (R-Adapt), freezing pre-trained weights and adapting only initial layers to balance robustness and clean accuracy.
- R-Adapt enables training-free, model-guided, and data-driven deployment and generalizes to large VLMs such as LLaVA and Qwen-VL, achieving strong robustness under attacks.
- The approach is validated across 18 datasets with state-of-the-art performance under various adversarial attacks, and a project page is provided.
Related Articles
Astral to Join OpenAI
Dev.to

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.
Reddit r/LocalLLaMA

Why Data is Important for LLM
Dev.to
The Inference Market Is Consolidating. Agent Payments Are Still Nobody's Problem.
Dev.to
YouTube's Deepfake Shield for Politicians Changes Evidence Forever
Dev.to