Me, Myself, and $\pi$ : Evaluating and Explaining LLM Introspection
arXiv cs.AI / 2026/3/24
💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research
要点
- The paper tackles the problem that LLM “introspection” evaluations may conflate true meta-cognition with generic knowledge or text-based self-simulation, and proposes a taxonomy to make introspection components distinguishable.
- It formalizes introspection as latent computation of specific operators over a model’s policy and parameters, aiming to ground introspection in mechanism rather than surface-level behavior.
- The authors introduce Introspect-Bench, a multifaceted evaluation suite intended to rigorously measure introspection capabilities in a more controlled way.
- Experiments suggest frontier models have better access to their own policies, improving performance on predicting their own behavior compared with peer models.
- The work includes causal/mechanistic evidence for how introspection can emerge without explicit training, attributing part of the mechanism to “attention diffusion.”
