PDA: Text-Augmented Defense Framework for Robust Vision-Language Models against Adversarial Image Attacks

arXiv cs.CV / 4/2/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces Paraphrase-Decomposition-Aggregation (PDA), a training-free inference-time defense intended to make vision-language models (VLMs) more robust to adversarial image perturbations.
PDA improves robustness by applying test-time prompt paraphrasing, decomposing questions, and aggregating consistency across the augmented text inputs, without changing the underlying VLM.
To manage the compute/latency trade-off, the authors instantiate PDA as “invariants” that reduce inference cost while preserving most of the robustness improvements.
Experiments across multiple VLM architectures and benchmarks for visual question answering, classification, and captioning report consistent robustness gains against diverse adversarial attacks while maintaining competitive accuracy on clean (non-adversarial) inputs.

Abstract

Vision-language models (VLMs) are vulnerable to adversarial image perturbations. Existing works based on adversarial training against task-specific adversarial examples are computationally expensive and often fail to generalize to unseen attack types. To address these limitations, we introduce Paraphrase-Decomposition-Aggregation (PDA), a training-free defense framework that leverages text augmentation to enhance VLM robustness under diverse adversarial image attacks. PDA performs prompt paraphrasing, question decomposition, and consistency aggregation entirely at test time, thus requiring no modification on the underlying models. To balance robustness and efficiency, we instantiate PDA as invariants that reduce the inference cost while retaining most of its robustness gains. Experiments on multiple VLM architectures and benchmarks for visual question answering, classification, and captioning show that PDA achieves consistent robustness gains against various adversarial perturbations while maintaining competitive clean accuracy, establishing a generic, strong and practical defense framework for VLMs during inference.

Benchmarking Batch Deep Reinforcement Learning Algorithms

Dev.to

Qwen3.6-Plus: Alibaba's Quiet Giant in the AI Race Delivers a Million-Token Enterprise Powerhouse

Dev.to

How To Leverage AI for Back-Office Headcount Optimization

Dev.to

Is 1-bit and TurboQuant the future of OSS? A simulation for Qwen3.5 models.

Reddit r/LocalLLaMA

SOTA Language Models Under 14B?

Reddit r/LocalLLaMA

PDA: Text-Augmented Defense Framework for Robust Vision-Language Models against Adversarial Image Attacks

Key Points

Abstract

Related Articles

Benchmarking Batch Deep Reinforcement Learning Algorithms

Qwen3.6-Plus: Alibaba's Quiet Giant in the AI Race Delivers a Million-Token Enterprise Powerhouse

How To Leverage AI for Back-Office Headcount Optimization

Is 1-bit and TurboQuant the future of OSS? A simulation for Qwen3.5 models.

SOTA Language Models Under 14B?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer