A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning

arXiv cs.CV / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies cross-modal “typographic attacks” that perturb audio, visual, and text inputs to compromise audio-visual multi-modal large language models (MLLMs) used in safety-critical settings.
It introduces “Multi-Modal Typography” as a systematic framework, extending beyond prior unimodal attack research to evaluate cross-modal fragility.
The authors find coordinated multi-modal attacks are substantially more effective than single-modality attacks, reporting an attack success rate of 83.43% versus 34.93%.
Experiments across multiple frontier MLLMs, tasks, and benchmarks (including common-sense reasoning and content moderation) suggest this strategy is underexplored yet critical for robustness evaluation.
The study will make code and data publicly available to support further research into defense and security testing of MLLMs.

Abstract

As audio-visual multi-modal large language models (MLLMs) are increasingly deployed in safety-critical applications, understanding their vulnerabilities is crucial. To this end, we introduce Multi-Modal Typography, a systematic study examining how typographic attacks across multiple modalities adversely influence MLLMs. While prior work focuses narrowly on unimodal attacks, we expose the cross-modal fragility of MLLMs. We analyze the interactions between audio, visual, and text perturbations and reveal that coordinated multi-modal attack creates a significantly more potent threat than single-modality attacks (attack success rate =

83.43\%

34.93\%

).Our findings across multiple frontier MLLMs, tasks, and common-sense reasoning and content moderation benchmarks establishes multi-modal typography as a critical and underexplored attack strategy in multi-modal reasoning. Code and data will be publicly available.

Black Hat Asia

AI Business

OpenAI vs Anthropic IPO Finances Compared — The 2026 AI Mega IPO Race

Dev.to

Prompt Engineering in 2026: Advanced Techniques for Better AI Results

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Ace Step 1.5 XL Models Available

Reddit r/LocalLLaMA

A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning

Key Points

Abstract

Related Articles

Black Hat Asia

OpenAI vs Anthropic IPO Finances Compared — The 2026 AI Mega IPO Race

Prompt Engineering in 2026: Advanced Techniques for Better AI Results

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Ace Step 1.5 XL Models Available

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer