A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning

arXiv cs.CV / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies cross-modal “typographic attacks” that perturb audio, visual, and text inputs to compromise audio-visual multi-modal large language models (MLLMs) used in safety-critical settings.
  • It introduces “Multi-Modal Typography” as a systematic framework, extending beyond prior unimodal attack research to evaluate cross-modal fragility.
  • The authors find coordinated multi-modal attacks are substantially more effective than single-modality attacks, reporting an attack success rate of 83.43% versus 34.93%.
  • Experiments across multiple frontier MLLMs, tasks, and benchmarks (including common-sense reasoning and content moderation) suggest this strategy is underexplored yet critical for robustness evaluation.
  • The study will make code and data publicly available to support further research into defense and security testing of MLLMs.

Abstract

As audio-visual multi-modal large language models (MLLMs) are increasingly deployed in safety-critical applications, understanding their vulnerabilities is crucial. To this end, we introduce Multi-Modal Typography, a systematic study examining how typographic attacks across multiple modalities adversely influence MLLMs. While prior work focuses narrowly on unimodal attacks, we expose the cross-modal fragility of MLLMs. We analyze the interactions between audio, visual, and text perturbations and reveal that coordinated multi-modal attack creates a significantly more potent threat than single-modality attacks (attack success rate = 83.43\% vs 34.93\%).Our findings across multiple frontier MLLMs, tasks, and common-sense reasoning and content moderation benchmarks establishes multi-modal typography as a critical and underexplored attack strategy in multi-modal reasoning. Code and data will be publicly available.