AI Security in the Foundation Model Era: A Comprehensive Survey from a Unified Perspective

arXiv cs.AI / 3/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that ML security research has been too fragmented, treating attacks and defenses in isolation rather than within a shared framework.
  • It emphasizes that, in foundation model settings, data and models are tightly coupled such that vulnerabilities in one can directly compromise the other.
  • The authors propose a unified closed-loop threat taxonomy using four directional axes for model–data interactions: Data→Data, Data→Model, Model→Data, and Model→Model.
  • Each category is mapped to concrete threat types, including data decryption and watermark removal, poisoning and jailbreaks, model inversion and membership inference, and model extraction.
  • The framework is presented as a basis for developing scalable, transferable, and cross-modal security strategies for foundation models.

Abstract

As machine learning (ML) systems expand in both scale and functionality, the security landscape has become increasingly complex, with a proliferation of attacks and defenses. However, existing studies largely treat these threats in isolation, lacking a coherent framework to expose their shared principles and interdependencies. This fragmented view hinders systematic understanding and limits the design of comprehensive defenses. Crucially, the two foundational assets of ML -- \textbf{data} and \textbf{models} -- are no longer independent; vulnerabilities in one directly compromise the other. The absence of a holistic framework leaves open questions about how these bidirectional risks propagate across the ML pipeline. To address this critical gap, we propose a \emph{unified closed-loop threat taxonomy} that explicitly frames model-data interactions along four directional axes. Our framework offers a principled lens for analyzing and defending foundation models. The resulting four classes of security threats represent distinct but interrelated categories of attacks: (1) Data\rightarrowData (D\rightarrowD): including \emph{data decryption attacks and watermark removal attacks}; (2) Data\rightarrowModel (D\rightarrowM): including \emph{poisoning, harmful fine-tuning attacks, and jailbreak attacks}; (3) Model\rightarrowData (M\rightarrowD): including \emph{model inversion, membership inference attacks, and training data extraction attacks}; (4) Model\rightarrowModel (M\rightarrowM): including \emph{model extraction attacks}. Our unified framework elucidates the underlying connections among these security threats and establishes a foundation for developing scalable, transferable, and cross-modal security strategies, particularly within the landscape of foundation models.
広告