Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives

arXiv cs.CL / 5/5/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces a simple perplexity-based technique to recover the finetuning objectives of “model organisms,” without needing model internals or prior assumptions about the targeted behavior.
The method generates completions using short random prefills from general corpora, then ranks them by the perplexity gap between a reference model and the finetuned model to surface objective-revealing outputs.
Experiments across 76 model organisms (0.5B–70B parameters), including backdoored models, synthetic-document finetuned models, and adversarially trained models, show that the technique often reveals the intended (or harmful) finetuning behaviors in the top-ranked results.
The approach works even when the exact pre-finetuning checkpoint is unavailable, using trusted reference models from other families as substitutes.
Because it only requires next-token probabilities (e.g., token logprobs), the technique is compatible with API-gated models that provide logprob information.

Abstract

Finetuning can significantly modify the behavior of large language models, including introducing harmful or unsafe behaviors. To study these risks, researchers develop model organisms: models finetuned to exhibit specific known behaviors for controlled experimentation. Identifying these behaviors remains challenging. We show that a simple perplexity-based method can surface finetuning objectives from model organisms by leveraging their tendency to overgeneralize their finetuned behaviors beyond the intended context. First, we generate diverse completions from the finetuned model using short random prefills drawn from general corpora. Second, we rank completions by decreasing perplexity gap between reference and finetuned models. The top-ranked completions often reveal the finetuning objectives, without requiring model internals or prior assumptions about the behavior. We evaluate this on a diverse set of model organisms (N=76, 0.5 to 70B parameters), including backdoored models, models finetuned to internalize false facts via synthetic document finetuning, adversarially trained models with hidden concerning behaviors, and models exhibiting emergent misalignment. For the vast majority of model organisms tested, the method surfaces completions revealing finetuning objectives within the top-ranked results, with models trained via synthetic document finetuning or to produce exact phrases being particularly susceptible. We further show that the technique can be effective even without access to the exact pre-finetuning checkpoint: trusted reference models from different families can serve as effective substitutes. As the method requires only next-token probabilities from the finetuned model, it is compatible with API-gated models that expose token logprobs.

Singapore's Fraud Frontier: Why AI Scam Detection Demands Regulatory Precision

Dev.to

Meta will use AI to analyze height and bone structure to identify if users are underage

TechCrunch

Google, Microsoft, and xAI will allow the US government to review their new AI models

The Verge

How AI is Changing the Way We Code in 2026: The Shift from Syntax to Strategy

Dev.to

ElevenLabs lists BlackRock, Jamie Foxx and Longoria as new investors

TechCrunch

Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives

Key Points

Abstract

Related Articles

Singapore's Fraud Frontier: Why AI Scam Detection Demands Regulatory Precision

Meta will use AI to analyze height and bone structure to identify if users are underage

Google, Microsoft, and xAI will allow the US government to review their new AI models

How AI is Changing the Way We Code in 2026: The Shift from Syntax to Strategy

ElevenLabs lists BlackRock, Jamie Foxx and Longoria as new investors

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer