How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models

arXiv cs.LG / 4/24/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper quantifies the “value” of adding one more recurrence to looped (depth-recurrent) language models using iso-depth scaling laws expressed in terms of equivalent unique parameters.
An iso-depth sweep across 116 pretraining runs (with recurrence counts r ∈ {1, 2, 4, 8} over ~50× training compute) fits a joint scaling law and introduces a new recurrence-equivalence exponent ϕ = 0.46 (R² = 0.997).
The exponent ϕ determines how much looping a block r times relates to validation loss compared with using r unique blocks or repeatedly running a single block; with ϕ = 0.46, each additional recurrence increases validation-loss cost in a predictable way at matched compute.
A concrete example shows that for r = 4, a 410M looped model matches a 580M non-looped model’s performance but incurs the training compute cost closer to a 1B non-looped model.
Downstream evaluations indicate the performance gap persists on parametric-knowledge tasks, narrows on simple open-book tasks, and cannot be resolved for reasoning tasks within the tested compute budgets.

Abstract

We measure how much one extra recurrence is worth to a looped (depth-recurrent) language model, in equivalent unique parameters. From an iso-depth sweep of 116 pretraining runs across recurrence counts

r \in \{1, 2, 4, 8\}

spanning

{\sim}50\times

in training compute, we fit a joint scaling law

L = E + A\,(N_\text{once} + r^{\varphi} N_\text{rec})^{-\alpha} + B\,D^{-\beta}

and recover a new recurrence-equivalence exponent

\varphi = 0.46

R^2 = 0.997

. Intuitively,

\varphi

tells us whether looping a block

r

times is equivalent in validation loss to

r

unique blocks of a non-looped model (full equivalence,

\varphi{=}1

) or to a single block run repeatedly with no capacity gain (

\varphi{=}0

). Our

\varphi = 0.46

sits in between, so each additional recurrence predictably increases validation loss at matched training compute. For example, at

r{=}4

a 410M looped model performs on par with a 580M non-looped model, but pays the training cost of a 1B non-looped one. On a five-axis downstream evaluation, the gap persists on parametric-knowledge tasks and closes on simple open-book tasks, while reasoning tasks are not resolvable at our compute budgets. For any looped LM, our

\varphi

converts the design choice of

r

into a predictable validation-loss cost, and future training recipes and architectures can be compared by how much they raise

\varphi

above

0.46

The 67th Attempt: When Your "Knowledge Management" System Becomes a Self-Fulfilling Prophecy of Excellence

Dev.to

Context Engineering for Developers: A Practical Guide (2026)

Dev.to

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.

Dev.to

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)

Dev.to

Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF

Reddit r/LocalLLaMA

How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models

Key Points

Abstract

Related Articles

The 67th Attempt: When Your "Knowledge Management" System Becomes a Self-Fulfilling Prophecy of Excellence

Context Engineering for Developers: A Practical Guide (2026)

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)

Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer