Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning

arXiv cs.CL / 4/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses a gap in auditing RL fine-tuning (RLFT) pipelines, where retrieved context may be legally restricted from further training and auditors need a way to detect violations.
It argues that existing techniques like verbatim memorization checks and membership inference do not work well for RLFT because RL tends to alter behavioral style more than specific fact retention.
The authors propose “Behavioral Canaries,” which embed trigger-referenced preference signals and reward distinctive stylistic responses so training-time influence can be inferred from conditioned behavioral changes.
Experiments show the method can detect unauthorized document-conditioned training with a 67% detection rate at a 10% false-positive rate (AUROC = 0.756) using a low canary injection rate of 1%.
Overall, the work introduces behavioral canaries as a mechanism for auditors to test whether restricted data influenced model behavior during training, even when the effect is distributional rather than memorization-based.

Abstract

In agentic workflows, LLMs frequently process retrieved contexts that are legally protected from further training. However, auditors currently lack a reliable way to verify if a provider has violated the terms of service by incorporating these data into post-training, especially through Reinforcement Learning (RL). While standard auditing relies on verbatim memorization and membership inference, these methods are ineffective for RL-trained models, as RL primarily influences a model's behavioral style rather than the retention of specific facts. To bridge this gap, we introduce Behavioral Canaries, a new auditing mechanism for RLFT pipelines. The framework instruments preference data by pairing document triggers with feedback that rewards a distinctive stylistic response, inducing a latent trigger-conditioned preference if such data are used in training. Empirical results show that these behavioral signals enable detection of unauthorized document-conditioned training, achieving a 67% detection rate at a 10% false-positive rate (AUROC = 0.756) at a 1% canary injection rate. More broadly, our results establish behavioral canaries as a new auditing mechanism for RLFT pipelines, enabling auditors to test for training-time influence even when such influence manifests as distributional behavioral change rather than memorization.