Robust Multimodal Safety via Conditional Decoding
arXiv cs.AI / 4/2/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that multimodal LLMs (MLLMs) can lose safety alignment when harmful queries exploit cross-modal interactions, with text-only alignment becoming less effective when additional modalities are added.
- It introduces CASA (Classification Augmented with Safety Attention), a conditional decoding approach that predicts a binary safety token using internal model representations before generating a response.
- CASA adds a safety attention module to improve detection of malicious queries while avoiding external classifiers, auxiliary heads, and modality-specific safety fine-tuning.
- Experiments on benchmarks including MM-SafetyBench, JailbreakV-28k, and adversarial audio tests show CASA reduces average attack success rates by over 97% across modalities and attack types.
- The method preserves strong performance on benign inputs, with both automated evaluation and human assessment by 13 trained annotators supporting its utility–safety tradeoff.
Related Articles

Benchmarking Batch Deep Reinforcement Learning Algorithms
Dev.to

Qwen3.6-Plus: Alibaba's Quiet Giant in the AI Race Delivers a Million-Token Enterprise Powerhouse
Dev.to

How To Leverage AI for Back-Office Headcount Optimization
Dev.to
Is 1-bit and TurboQuant the future of OSS? A simulation for Qwen3.5 models.
Reddit r/LocalLLaMA
SOTA Language Models Under 14B?
Reddit r/LocalLLaMA