AI Navigate

Quantitative Introspection in Language Models: Tracking Internal States Across Conversation

arXiv cs.AI / 3/20/2026

📰 NewsModels & Research

Key Points

  • The paper investigates numeric self-reports as a tool to track internal emotive states of LLMs across conversations and demonstrates measurable coupling with probe-defined internal states.
  • It shows greedy decoding yields uninformative self-reports, while logit-based self-reports reveal interpretable state tracking with correlations (Spearman 0.40–0.76; R^2 0.12–0.54 in LLaMA-3.2-3B-Instruct).
  • The results include evidence that introspection is present from turn 1 and can be boosted by activation steering to affect other concepts, with delta R^2 up to 0.30.
  • Introspection scales with model size, reaching high correlation (R^2 ~0.93) in larger models and suggesting numeric self-report as a viable complementary tool for monitoring internal emotive states in conversational AI.

Abstract

Tracking the internal states of large language models across conversations is important for safety, interpretability, and model welfare, yet current methods are limited. Linear probes and other white-box methods compress high-dimensional representations imperfectly and are harder to apply with increasing model size. Taking inspiration from human psychology, where numeric self-report is a widely used tool for tracking internal states, we ask whether LLMs' own numeric self-reports can track probe-defined emotive states over time. We study four concept pairs (wellbeing, interest, focus, and impulsivity) in 40 ten-turn conversations, operationalizing introspection as the causal informational coupling between a model's self-report and a concept-matched probe-defined internal state. We find that greedy-decoded self-reports collapse outputs to few uninformative values, but introspective capacity can be unmasked by calculating logit-based self-reports. This metric tracks interpretable internal states (Spearman \rho = 0.40-0.76; isotonic R^2 = 0.12-0.54 in LLaMA-3.2-3B-Instruct), follows how those states change over time, and activation steering confirms the coupling is causal. Furthermore, we find that introspection is present at turn 1 but evolves through conversation, and can be selectively improved by steering along one concept to boost introspection for another (\Delta R^2 up to 0.30). Crucially, these phenomena scale with model size in some cases, approaching R^2 \approx 0.93 in LLaMA-3.1-8B-Instruct, and partially replicate in other model families. Together, these results position numeric self-report as a viable, complementary tool for tracking internal emotive states in conversational AI systems.