When Does Removing LayerNorm Help? Activation Bounding as a Regime-Dependent Implicit Regularizer

arXiv cs.LG / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces Dynamic Tanh (DyT), which removes LayerNorm by bounding activations using a learned tanh(αx), and argues this replacement acts as an implicit regularizer rather than a universally better substitute.
Experiments across GPT-2-family models (64M–3.78B parameters) and different token regimes (1M vs 118M tokens), plus Llama and ViT checks, show regime dependence: DyT improves validation loss by 27.3% at 64M/1M but worsens it by 18.8% at 64M/118M, with the benefit disappearing at larger capacity.
The authors directly measure activation saturation to support the mechanism, reporting far higher saturation at 1M tokens (49%) than at 118M tokens (23%), and using saturation-based heuristics to classify DyT behavior on a GPT-2 calibration set.
Several interventions back the “regime-dependent implicit regularization” explanation: HardTanh reproduces the pattern, increasing α at 118M reduces the penalty, and a vanilla model with dropout (p=0.5) matches DyT’s data-rich loss.
For Llama, the study localizes DyT’s failure mode (“collapse”) to SwiGLU gating via ablation, distinguishing saturation-linked collapse from normal convergence, under compute-limited training settings below Chinchilla-optimality.

Abstract

Dynamic Tanh (DyT) removes LayerNorm by bounding activations with a learned tanh(alpha x). We show that this bounding is a regime-dependent implicit regularizer, not a uniformly beneficial replacement. Across GPT-2-family models spanning 64M to 3.78B parameters and 1M to 118M tokens, with Llama and ViT cross-checks, DyT improves validation loss by 27.3% at 64M/1M but worsens it by 18.8% at 64M/118M; the 1M benefit vanishes with capacity (+1.7% at 3.78B), while the 118M penalty reaches +27.9%. The mechanism is measurable: 49% of DyT activations saturate at 1M versus 23% at 118M, and a 500-step saturation heuristic classifies DyT's sign with 75% raw in-sample accuracy on the 12-cell GPT-2 calibration set (AUC 0.75; 64% when adding Scale 5 stress cells), correctly labels 3/3 Llama checks, but only reaches 50% raw leave-one-scale-out accuracy. Three interventions support the bounding explanation: HardTanh reproduces the regime pattern, increasing alpha at 118M monotonically reduces DyT's penalty, and vanilla+dropout(p=0.5) matches DyT's data-rich loss. We also localize Llama-DyT collapse to SwiGLU gating, where saturation separates collapse from convergence in a 3-seed component ablation (r=0.94). Scope: all experiments are compute-limited (T/P < 1.84), below Chinchilla-optimal training.

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Same Agent, Different Risk | How Microsoft 365 Copilot Grounding Changes the Security Model | Rahsi Framework™

Dev.to

Claude Haiku for Low-Cost AI Inference: Patterns from a Horse Racing Prediction System

Dev.to

How We Built an Ambient AI Clinical Documentation Pipeline (and Saved Doctors 8+ Hours a Week)

Dev.to

🦀 PicoClaw Deep Dive — A Field Guide to Building an Ultra-Light AI Agent in Go 🐹

Dev.to

When Does Removing LayerNorm Help? Activation Bounding as a Regime-Dependent Implicit Regularizer

Key Points

Abstract

Related Articles

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Same Agent, Different Risk | How Microsoft 365 Copilot Grounding Changes the Security Model | Rahsi Framework™

Claude Haiku for Low-Cost AI Inference: Patterns from a Horse Racing Prediction System

How We Built an Ambient AI Clinical Documentation Pipeline (and Saved Doctors 8+ Hours a Week)

🦀 PicoClaw Deep Dive — A Field Guide to Building an Ultra-Light AI Agent in Go 🐹

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer