Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization
arXiv cs.CV / 4/13/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper analyzes multimodal jailbreak vulnerabilities in vision-language models and finds that attack effectiveness varies significantly between homogeneous (open-source surrogate/target) and heterogeneous (surrogate/target mismatch) settings, which it terms “surrogate dependency.”
- It proposes “Mosaic,” a multi-view ensemble optimization framework designed to reduce over-reliance on any single surrogate model and any single image view when attacking closed-source VLMs.
- Mosaic uses three modules: a text-side transformation that perturbs refusal-sensitive lexical patterns, a multi-view image optimization that updates perturbations across cropped views, and an ensemble guidance mechanism that aggregates optimization signals from multiple surrogate VLMs.
- Experiments on safety benchmarks report state-of-the-art results, including higher Attack Success Rate and lower/mitigated safety-related metrics (Average Toxicity) against commercial closed-source VLMs.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Black Hat Asia
AI Business

Apple is building smart glasses without a display to serve as an AI wearable
THE DECODER

Why Fashion Trend Prediction Isn’t Enough Without Generative AI
Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Chatbot vs Voicebot: The Real Business Decision Nobody Talks About
Dev.to