SALLIE: Safeguarding Against Latent Language & Image Exploits

arXiv cs.AI / 4/10/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • SALLIEは、LLMとVLMが直面するテキスト/画像のジャイルブレイクやプロンプトインジェクションに対し、モダリティをまたいで同時に対処する統一型防御フレームワークとして提案されています。
  • 既存の対策が性能低下や複雑な前処理、または脅威を別々に扱う問題を抱える点に対し、SALLIEはモデル内部の活性(機械的解釈可能性に基づく信号)を軽量な実行時検知として抽出します。
  • 推論時は(1)残差ストリームの内部活性抽出、(2)層ごとの悪意スコアをk-NNで算出、(3)層アンサンブルで集約、という3段構えで判定を行います。
  • SALLIEは標準的なトークンレベル融合パイプラインにシームレスに統合でき、アーキテクチャ改修を不要としつつ、Phi-3.5-vision-instruct、SmolVLM2、gemma-3-4b-itといったコンパクトモデルで10超のデータセットにわたって既存手法より一貫して優れると報告されています。

Abstract

Large Language Models (LLMs) and Vision-Language Models (VLMs) remain highly vulnerable to textual and visual jailbreaks, as well as prompt injections (arXiv:2307.15043, Greshake et al., 2023, arXiv:2306.13213). Existing defenses often degrade performance through complex input transformations or treat multimodal threats as isolated problems (arXiv:2309.00614, arXiv:2310.03684, Zhang et al., 2025). To address the critical gap for a unified, modal-agnostic defense that mitigates both textual and visual threats simultaneously without degrading performance or requiring architectural modifications, we introduce SALLIE (Safeguarding Against Latent Language & Image Exploits), a lightweight runtime detection framework rooted in mechanistic interpretability (Lindsey et al., 2025, Ameisen et al., 2025). By integrating seamlessly into standard token-level fusion pipelines (arXiv:2306.13549), SALLIE extracts robust signals directly from the model's internal activations. At inference, SALLIE defends via a three-stage architecture: (1) extracting internal residual stream activations, (2) calculating layer-wise maliciousness scores using a K-Nearest Neighbors (k-NN) classifier, and (3) aggregating these predictions via a layer ensemble module. We evaluate SALLIE on compact, open-source architectures - Phi-3.5-vision-instruct (arXiv:2404.14219), SmolVLM2-2.2B-Instruct (arXiv:2504.05299), and gemma-3-4b-it (arXiv:2503.19786) - prioritized for practical inference times and real-world deployment costs. Our comprehensive evaluation pipeline spans over ten datasets and more than five strong baseline methods from the literature, and SALLIE consistently outperforms these baselines across a wide range of experimental settings.