Introspection Adapters: Training LLMs to Report Their Learned Behaviors

arXiv cs.AI / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

The paper proposes “Introspection Adapters” (IAs), a technique to audit fine-tuned LLMs by having them verbalize learned behaviors in natural language.
It trains IAs as a single LoRA adapter jointly across multiple fine-tuned variants, using implanted behavior labels to teach the model to report those behaviors.
The authors find the approach generalizes to fine-tunes that differ greatly from the source models and can perform strongly on AuditBench for detecting explicitly hidden harmful behaviors.
The method is also presented as a way to detect encrypted fine-tuning API attacks, with scaling benefits across model sizes and training data diversity.

Abstract

When model developers or users fine-tune an LLM, this can induce behaviors that are unexpected, deliberately harmful, or hard to detect. It would be far easier to audit LLMs if they could simply describe their behaviors in natural language. Here, we study a scalable approach to rapidly identify learned behaviors of many LLMs derived from a shared base LLM. Given a model

M

, our method works by finetuning models

M_i

from

M

with implanted behaviors

b_i

; the

(M_i, b_i)

pairs serve as labeled training data. We then train an \emph{introspection adapter} (IA): a single LoRA adapter jointly trained across the finetunes

M_i

to cause them to verbalize their implanted behaviors. We find that this IA induces self-description of learned behaviors even in finetunes of

M

that were trained in very different ways from the

M_i

. For example, IAs generalize to AuditBench, achieving state-of-the-art at identifying explicitly hidden concerning behaviors. IAs can also be used to detect encrypted finetuning API attacks. They scale favorably with model size and training data diversity. Overall, our results suggest that IAs are a scalable, effective, and practically useful approach to auditing fine-tuned LLMs.

¿Hasta qué punto podría la IA reemplazarnos en nuestros trabajos? A veces creo que la gente exagera un poco.

Reddit r/artificial

Why I Built byCode: A 100% Local, Privacy-First AI IDE

Dev.to

Magnificent irony as Meta staff unhappy about running surveillance software on work PCs

The Register

v0.21.1

Ollama Releases

How I Built an AI Agent That Investigates Cloud Bill Spikes (Architecture Inside)

Dev.to

Introspection Adapters: Training LLMs to Report Their Learned Behaviors

Key Points

Abstract

Related Articles

¿Hasta qué punto podría la IA reemplazarnos en nuestros trabajos? A veces creo que la gente exagera un poco.

Why I Built byCode: A 100% Local, Privacy-First AI IDE

Magnificent irony as Meta staff unhappy about running surveillance software on work PCs

v0.21.1

How I Built an AI Agent That Investigates Cloud Bill Spikes (Architecture Inside)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer