Revealing the Learning Dynamics of Long-Context Continual Pre-training

arXiv cs.CL / 4/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that findings from small-scale long-context continual pre-training (tens of billions of tokens) do not reliably transfer to industrial-grade LLMs due to risks like insufficient adaptation and early/premature termination of training.
  • Using the industrial-grade Hunyuan-A13B (80B parameters) over a 200B-token trajectory, the authors present a first systematic study of long-context continual pre-training learning dynamics across behavioral, probabilistic, and mechanistic levels.
  • Results show that massive data scaling is necessary: Hunyuan-A13B still reaches saturation only after more than 150B tokens, making smaller regimes inadequate for industrial models.
  • The authors differentiate “deceptive saturation” in Needle-in-a-Haystack (NIAH)-style evaluations from “intrinsic saturation,” finding perplexity (PPL)-based analysis better reflects ongoing learning and correlates more strongly with downstream performance.
  • For training stability and progress monitoring, they propose mechanistic monitoring where retrieval heads’ attention-score evolution serves as an efficient, low-resource indicator tightly correlated with supervised fine-tuning (SFT) outcomes.

Abstract

Existing studies on Long-Context Continual Pre-training (LCCP) mainly focus on small-scale models and limited data regimes (tens of billions of tokens). We argue that directly migrating these small-scale settings to industrial-grade models risks insufficient adaptation and premature training termination. Furthermore, current evaluation methods rely heavily on downstream benchmarks (e.g., Needle-in-a-Haystack), which often fail to reflect the intrinsic convergence state and can lead to "deceptive saturation". In this paper, we present the first systematic investigation of LCCP learning dynamics using the industrial-grade Hunyuan-A13B (80B total parameters), tracking its evolution across a 200B-token training trajectory. Specifically, we propose a hierarchical framework to analyze LCCP dynamics across behavioral (supervised fine-tuning probing), probabilistic (perplexity), and mechanistic (attention patterns) levels. Our findings reveal: (1) Necessity of Massive Data Scaling: Training regimes of dozens of billions of tokens are insufficient for industrial-grade LLMs' LCCP (e.g., Hunyuan-A13B reaches saturation after training over 150B tokens). (2) Deceptive Saturation vs. Intrinsic Saturation: Traditional NIAH scores report "fake saturation" early, while our PPL-based analysis reveals continuous intrinsic improvements and correlates more strongly with downstream performance. (3) Mechanistic Monitoring for Training Stability: Retrieval heads act as efficient, low-resource training monitors, as their evolving attention scores reliably track LCCP progress and exhibit high correlation with SFT results. This work provides a comprehensive monitoring framework, evaluation system, and mechanistic interpretation for the LCCP of industrial-grade LLM.