G-Drift MIA: Membership Inference via Gradient-Induced Feature Drift in LLMs

arXiv cs.LG / 4/2/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces G-Drift MIA, a white-box membership inference attack for LLMs that uses a single targeted gradient-ascent step to induce measurable “feature drift” in internal representations.
  • Instead of relying mainly on output confidence or loss/perplexity, it compares representation changes across logits, hidden-layer activations, and projections onto fixed feature directions to train a lightweight logistic classifier for member vs. non-member detection.
  • Experiments across multiple transformer LLMs and realistic benchmark-derived datasets show that G-Drift substantially outperforms prior confidence-, perplexity-, and reference-based MIA approaches, which often perform near random when training and query samples come from the same distribution.
  • The authors provide a mechanistic explanation: memorized training samples show smaller and more structured feature drift than non-members, linking gradient geometry, representation stability, and memorization.
  • Overall, the results position small, controlled gradient interventions as an effective auditing technique for assessing LLM privacy risk related to whether specific data points were included in training.

Abstract

Large language models (LLMs) are trained on massive web-scale corpora, raising growing concerns about privacy and copyright. Membership inference attacks (MIAs) aim to determine whether a given example was used during training. Existing LLM MIAs largely rely on output probabilities or loss values and often perform only marginally better than random guessing when members and non-members are drawn from the same distribution. We introduce G-Drift MIA, a white-box membership inference method based on gradient-induced feature drift. Given a candidate (x,y), we apply a single targeted gradient-ascent step that increases its loss and measure the resulting changes in internal representations, including logits, hidden-layer activations, and projections onto fixed feature directions, before and after the update. These drift signals are used to train a lightweight logistic classifier that effectively separates members from non-members. Across multiple transformer-based LLMs and datasets derived from realistic MIA benchmarks, G-Drift substantially outperforms confidence-based, perplexity-based, and reference-based attacks. We further show that memorized training samples systematically exhibit smaller and more structured feature drift than non-members, providing a mechanistic link between gradient geometry, representation stability, and memorization. In general, our results demonstrate that small, controlled gradient interventions offer a practical tool for auditing the membership of training-data and assessing privacy risks in LLMs.