When Does Removing LayerNorm Help? Activation Bounding as a Regime-Dependent Implicit Regularizer

arXiv cs.LG / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Dynamic Tanh (DyT), which removes LayerNorm by bounding activations using a learned tanh(αx), and argues this replacement acts as an implicit regularizer rather than a universally better substitute.
  • Experiments across GPT-2-family models (64M–3.78B parameters) and different token regimes (1M vs 118M tokens), plus Llama and ViT checks, show regime dependence: DyT improves validation loss by 27.3% at 64M/1M but worsens it by 18.8% at 64M/118M, with the benefit disappearing at larger capacity.
  • The authors directly measure activation saturation to support the mechanism, reporting far higher saturation at 1M tokens (49%) than at 118M tokens (23%), and using saturation-based heuristics to classify DyT behavior on a GPT-2 calibration set.
  • Several interventions back the “regime-dependent implicit regularization” explanation: HardTanh reproduces the pattern, increasing α at 118M reduces the penalty, and a vanilla model with dropout (p=0.5) matches DyT’s data-rich loss.
  • For Llama, the study localizes DyT’s failure mode (“collapse”) to SwiGLU gating via ablation, distinguishing saturation-linked collapse from normal convergence, under compute-limited training settings below Chinchilla-optimality.

Abstract

Dynamic Tanh (DyT) removes LayerNorm by bounding activations with a learned tanh(alpha x). We show that this bounding is a regime-dependent implicit regularizer, not a uniformly beneficial replacement. Across GPT-2-family models spanning 64M to 3.78B parameters and 1M to 118M tokens, with Llama and ViT cross-checks, DyT improves validation loss by 27.3% at 64M/1M but worsens it by 18.8% at 64M/118M; the 1M benefit vanishes with capacity (+1.7% at 3.78B), while the 118M penalty reaches +27.9%. The mechanism is measurable: 49% of DyT activations saturate at 1M versus 23% at 118M, and a 500-step saturation heuristic classifies DyT's sign with 75% raw in-sample accuracy on the 12-cell GPT-2 calibration set (AUC 0.75; 64% when adding Scale 5 stress cells), correctly labels 3/3 Llama checks, but only reaches 50% raw leave-one-scale-out accuracy. Three interventions support the bounding explanation: HardTanh reproduces the regime pattern, increasing alpha at 118M monotonically reduces DyT's penalty, and vanilla+dropout(p=0.5) matches DyT's data-rich loss. We also localize Llama-DyT collapse to SwiGLU gating, where saturation separates collapse from convergence in a 3-seed component ablation (r=0.94). Scope: all experiments are compute-limited (T/P < 1.84), below Chinchilla-optimal training.