Hierarchical vs. Flat Iteration in Shared-Weight Transformers

arXiv cs.CL / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies whether a hierarchically structured, shared-weight recurrent scheme in Transformers (HRM-LM) can achieve the same representational quality as stacking independent Transformer layers.
HRM-LM replaces N Transformer layers with a two-speed recurrent design, using a Fast module at every step for local refinement and a Slow module every T steps for global compression.
The method is unrolled for M = N×T recurrent steps using shared parameters across the unrolled computation.
Using a parameter-matched Universal Transformer ablation (UniTF, 1.2B) across five independent runs, the authors find a robust and sharp empirical performance/representation gap between HRM-LM and the baseline stacking approach.

Abstract

We present an empirical study of whether hierarchically structured, shared-weight recurrence can match the representational quality of independent-layer stacking in a Transformer-based language model. HRM-LM replaces L independent Transformer layers with a two-speed recurrent pair: a Fast module operating at every step for local refinement, and a Slow module operating every T steps for global compression. This recurrent hierarchy is unrolled for M = N x T steps with shared parameters. The central and most robust finding, supported by a parameter-matched Universal Transformer ablation (UniTF, 1.2B) across five independent runs, is a sharp empirical gap between the two approaches.