Precise Shield: Explaining and Aligning VLLM Safety via Neuron-Level Guidance

arXiv cs.CV / 4/13/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that vision-language large models can be exploited by composite multilingual/multimodal attacks, where harmful images combined with low-resource language text bypass defenses tuned for high-resource languages.
It frames a mechanistic question of where “safety capability” lives in VLLMs and whether it is concentrated in a small set of “safety neurons” shared across languages and modalities.
Precise Shield is proposed as a two-stage method: it identifies safety neurons by contrasting harmful vs. benign activation patterns and then uses gradient masking to restrict updates to a tiny neuron subspace (<0.03% of parameters).
The authors report that this neuron-level constraint improves safety while largely preserving multilingual and multimodal generalization.
They find moderate overlap of safety neurons across languages and modalities, enabling zero-shot transfer of safety improvements across both modalities and languages.

Abstract

In real-world deployments, Vision-Language Large Models (VLLMs) face critical challenges from multilingual and multimodal composite attacks: harmful images paired with low-resource language texts can easily bypass defenses designed for high-resource language scenarios, exposing structural blind spots in current cross-lingual and cross-modal safety methods. This raises a mechanistic question: where is safety capability instantiated within the model, and how is it distributed across languages and modalities? Prior studies on pure-text LLMs have identified cross-lingual shared safety neurons, suggesting that safety may be governed by a small subset of critical neurons. Leveraging this insight, we propose Precise Shield, a two-stage framework that first identifies safety neurons by contrasting activation patterns between harmful and benign inputs, and then constrains parameter updates strictly within this subspace via gradient masking with affecting fewer than 0.03% of parameters. This strategy substantially improves safety while preserving multilingual and multimodal generalization. Further analysis reveals a moderate overlap of safety neurons across languages and modalities, enabling zero-shot cross-lingual and cross-modal transfer of safety capabilities, and offering a new direction for neuron-level, transfer-based safety enhancement.