Subcritical Signal Propagation at Initialization in Normalization-Free Transformers

arXiv cs.LG / 4/15/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper analyzes signal and gradient propagation at transformer initialization using the averaged partial Jacobian norm (APJN) as a measure of gradient amplification across layers.
It extends APJN theory to bidirectional attention and permutation-symmetric token setups by deriving layer-to-layer recurrence relations for activation statistics and APJNs.
The results show that attention changes the asymptotic APJN behavior at large depth and that the framework matches APJN measurements reported in deep vision transformers.
It finds a criticality analogy with residual networks: pre-LayerNorm transformers show power-law APJN growth (critical), while replacing LayerNorm with tanh-like nonlinearities yields stretched-exponential APJN growth (subcritical).
The theory explains why Dynamic Tanh (DyT) and Dynamic erf (Derf) transformers can be more sensitive to initialization/optimization and therefore need careful tuning for stable training.

Abstract

We study signal propagation at initialization in transformers through the averaged partial Jacobian norm (APJN), a measure of gradient amplification across layers. We extend APJN analysis to transformers with bidirectional attention and permutation-symmetric input token configurations by deriving recurrence relations for activation statistics and APJNs across layers. Our theory predicts how attention modifies the asymptotic behavior of the APJN at large depth and matches APJNs measured in deep vision transformers. The criticality picture known from residual networks carries over to transformers: the pre-LayerNorm architecture exhibits power-law APJN growth, whereas transformers with LayerNorm replaced by elementwise

\tanh

-like nonlinearities have stretched-exponential APJN growth, indicating that the latter are subcritical. Applied to Dynamic Tanh (DyT) and Dynamic erf (Derf) transformers, the theory explains why these architectures can be more sensitive to initialization and optimization choices and require careful tuning for stable training.

Black Hat Asia

AI Business

The Complete Guide to Better Meeting Productivity with AI Note-Taking

Dev.to

5 Ways Real-Time AI Can Boost Your Sales Call Performance

Dev.to

RAG in Practice — Part 4: Chunking, Retrieval, and the Decisions That Break RAG

Dev.to

Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]

Reddit r/MachineLearning

Subcritical Signal Propagation at Initialization in Normalization-Free Transformers

Key Points

Abstract

Related Articles

Black Hat Asia

The Complete Guide to Better Meeting Productivity with AI Note-Taking

5 Ways Real-Time AI Can Boost Your Sales Call Performance

RAG in Practice — Part 4: Chunking, Retrieval, and the Decisions That Break RAG

Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer