Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen!

arXiv cs.CL / 4/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper shows that creators of open-source LLMs can extract a downstream fine-tuning dataset later by using backdoor training, even with only black-box access to the fine-tuned downstream model.
  • Experiments across four open-source LLMs (3B–32B parameters) and two downstream datasets report high extraction effectiveness, with up to 76.3% of queries perfectly extracted in practical settings.
  • Under more ideal conditions, the success rate rises to 94.9%, indicating the threat can be severe when fine-tuning with sensitive proprietary data.
  • The authors test detection-based defenses but find they can be bypassed with improved attacks, suggesting current mitigations may be insufficient.
  • The work releases code and data for reproducibility, emphasizing the need for follow-up research to address this newly identified data-breach risk in fine-tuning.

Abstract

Fine-tuning on open-source Large Language Models (LLMs) with proprietary data is now a standard practice for downstream developers to obtain task-specific LLMs. Surprisingly, we reveal a new and concerning risk along with the practice: the creator of the open-source LLMs can later extract the private downstream fine-tuning data through simple backdoor training, only requiring black-box access to the fine-tuned downstream model. Our comprehensive experiments, across 4 popularly used open-source models with 3B to 32B parameters and 2 downstream datasets, suggest that the extraction performance can be strikingly high: in practical settings, as much as 76.3% downstream fine-tuning data (queries) out of a total 5,000 samples can be perfectly extracted, and the success rate can increase to 94.9% in more ideal settings. We also explore a detection-based defense strategy but find it can be bypassed with improved attack. Overall, we highlight the emergency of this newly identified data breaching risk in fine-tuning, and we hope that more follow-up research could push the progress of addressing this concerning risk. The code and data used in our experiments are released at https://github.com/thu-coai/Backdoor-Data-Extraction.