Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen!
arXiv cs.CL / 4/6/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper shows that creators of open-source LLMs can extract a downstream fine-tuning dataset later by using backdoor training, even with only black-box access to the fine-tuned downstream model.
- Experiments across four open-source LLMs (3B–32B parameters) and two downstream datasets report high extraction effectiveness, with up to 76.3% of queries perfectly extracted in practical settings.
- Under more ideal conditions, the success rate rises to 94.9%, indicating the threat can be severe when fine-tuning with sensitive proprietary data.
- The authors test detection-based defenses but find they can be bypassed with improved attacks, suggesting current mitigations may be insufficient.
- The work releases code and data for reproducibility, emphasizing the need for follow-up research to address this newly identified data-breach risk in fine-tuning.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.




