Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining

arXiv cs.CL / 5/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses a gap in understanding when specific linguistic abilities emerge during LLM pretraining, since conventional benchmarks do not show how such concepts are acquired over time.
  • It uses sparse crosscoders to discover and align internal features across different model checkpoints, enabling tracking of linguistic feature evolution during pretraining.
  • The authors train crosscoders on open-sourced checkpoint triplets with substantial performance and representation shifts to study changes in learned representations.
  • They introduce a new metric, Relative Indirect Effects (RelIE), to identify the training stages when individual features become causally important for task performance.
  • Results show that the method can detect phases where features emerge, persist, or discontinue, and the approach is architecture-agnostic and scalable for more interpretable representation-learning analysis.

Abstract

Large language models (LLMs) learn non-trivial abstractions during pretraining, such as detecting irregular plural noun subjects. However, because traditional evaluation methods (e.g., benchmarking) fail to reveal how models acquire these concepts and capabilities, it is not well understood when and how these specific linguistic abilities emerge. To bridge this gap and better understand model training at the concept level, we use sparse crosscoders to discover and align features across model checkpoints. Using this approach, we track the evolution of linguistic features during pretraining. We train crosscoders between open-sourced checkpoint triplets with significant performance and representation shifts, and introduce a novel metric, Relative Indirect Effects (RelIE), to trace training stages at which individual features become causally important for task performance. We show that crosscoders can detect feature emergence, maintenance, and discontinuation during pretraining. Our approach is architecture-agnostic and scalable, offering a promising path toward more interpretable and fine-grained analysis of representation learning throughout pretraining.