Parcae: Scaling Laws For Stable Looped Language Models

arXiv cs.LG / 4/15/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • ループ構造(同一層群をループで反復して計算量=FLOPsを増やす)を、残差ストリーム上の非線形・時間変動の力学系として捉え直し、既存手法の不安定さの要因をスペクトルノルムに帰着する解析を提示した。
  • 不安定性は、ループへの“注入”パラメータのスペクトルノルムが大きいことに起因するとしており、これを抑えるためにParcaeでは負の対角パラメータ化の離散化により注入パラメータのスペクトルノルムを制約する設計を提案している。
  • Parcaeは既存の大規模ループドモデルに比べて最大6.3%の検証パープレキシティ改善を達成した。
  • さらにParcaeを用いて、ループで品質を押し上げる際のスケーリング則(学習時は固定パラメータ数でFLOPsを増やすときのパワー則、テスト時は飽和的な指数減衰に従う計算スケール則)を導出している。
  • 1.3Bパラメータ規模では、固定のパラメータ/データ予算下でTransformer強ベースラインに対しCOREとCORE-Extendedがそれぞれ+2.99/+1.18改善し、相対品質で最大87.5%(2倍サイズTransformer比)を報告している。

Abstract

Traditional fixed-depth architectures scale quality by increasing training FLOPs, typically through increased parameterization, at the expense of a higher memory footprint, or data. A potential alternative is looped architectures, which instead increase FLOPs by sending activations through a block of layers in a loop. While promising, existing recipes for training looped architectures can be unstable, suffering from residual explosion and loss spikes. We address these challenges by recasting looping as a nonlinear time-variant dynamical system over the residual stream. Via a linear approximation to this system, we find that instability occurs in existing looped architectures as a result of large spectral norms in their injection parameters. To address these instability issues, we propose Parcae, a novel stable, looped architecture that constrains the spectral norm of the injection parameters via discretization of a negative diagonal parameterization. As a result, Parcae achieves up to 6.3% lower validation perplexity over prior large-scale looped models. Using our stable looped architecture, we investigate the scaling properties of looping as a medium to improve quality by increasing FLOPs in training and test-time. For training, we derive predictable power laws to scale FLOPs while keeping parameter count fixed. Our initial scaling laws suggest that looping and data should be increased in tandem, given a fixed FLOP budget. At test-time, we find that Parcae can use looping to scale compute, following a predictable, saturating exponential decay. When scaled up to 1.3B parameters, we find that Parcae improves CORE and Core-Extended quality by 2.99 and 1.18 points when compared to strong Transformer baselines under a fixed parameter and data budget, achieving a relative quality of up to 87.5% a Transformer twice the size.