Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation

arXiv cs.CV / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper shows that multimodal large language models used for content moderation can be evaded via “adversarial smuggling,” which hides harmful content in human-readable visuals that the model fails to read or understand.
  • It distinguishes two attack mechanisms—Perceptual Blindness (disrupting text recognition) and Reasoning Blockade (blocking semantic understanding even when text is recognized).
  • The authors introduce SmuggleBench, a benchmark with 1,700 adversarial smuggling instances, and report attack success rates above 90% against both proprietary models (e.g., GPT-5) and open-source models (e.g., Qwen3-VL).
  • Vulnerability analysis points to root causes including limited vision encoder capability, OCR robustness gaps, and a lack of domain-specific adversarial examples.
  • Initial mitigation experiments explore test-time scaling using Chain-of-Thought and adversarial training via SFT, and the authors release code publicly for further research and defense development.

Abstract

Multimodal Large Language Models (MLLMs) are increasingly being deployed as automated content moderators. Within this landscape, we uncover a critical threat: Adversarial Smuggling Attacks. Unlike adversarial perturbations (for misclassification) and adversarial jailbreaks (for harmful output generation), adversarial smuggling exploits the Human-AI capability gap. It encodes harmful content into human-readable visual formats that remain AI-unreadable, thereby evading automated detection and enabling the dissemination of harmful content. We classify smuggling attacks into two pathways: (1) Perceptual Blindness, disrupting text recognition; and (2) Reasoning Blockade, inhibiting semantic understanding despite successful text recognition. To evaluate this threat, we constructed SmuggleBench, the first comprehensive benchmark comprising 1,700 adversarial smuggling attack instances. Evaluations on SmuggleBench reveal that both proprietary (e.g., GPT-5) and open-source (e.g., Qwen3-VL) state-of-the-art models are vulnerable to this threat, producing Attack Success Rates (ASR) exceeding 90%. By analyzing the vulnerability through the lenses of perception and reasoning, we identify three root causes: the limited capabilities of vision encoders, the robustness gap in OCR, and the scarcity of domain-specific adversarial examples. We conduct a preliminary exploration of mitigation strategies, investigating the potential of test-time scaling (via CoT) and adversarial training (via SFT) to mitigate this threat. Our code is publicly available at https://github.com/zhihengli-casia/smugglebench.