Can an MLP Absorb Its Own Skip Connection?

arXiv cs.LG / 4/28/2026

📰 NewsModels & Research

共有:

Key Points

The paper analyzes whether a skip connection around a single-hidden-layer MLP can be mathematically “absorbed” into a residual-free MLP of the same width, focusing on function-class equivalence.
It proves that when the skip branch is an invertible linear map, the question reduces to the identity-skip case, including settings such as Hyper-Connections.
For homogeneous activations with degree k ≠ 1 (e.g., ReLU² and ReGLU), absorption is unconditionally impossible, and a similar impossibility follows for gated differentiable activations like SwiGLU and GeGLU via linearization.
The impossibility results also extend to arbitrary depth: compositions of L residual blocks using these activations cannot be replicated by any composition of L residual-free blocks of the same width.
For ungated ReLU and GELU, absorption can occur only under a specific non-generic algebraic condition on weights, suggesting skip-connected vs residual-free MLPs are generically disjoint; whether this stays true in deep stacks is still open.

Abstract

We study when a skip connection around a single-hidden-layer MLP can be absorbed into a residual-free MLP of the same width. We first show that for any architecture whose skip branch is an invertible linear map (including Hyper-Connections and their manifold-constrained variants), the problem reduces to the identity skip case. For homogeneous activations of degree

k eq 1

, such as ReLU

^2

and ReGLU, absorption is unconditionally impossible by a degree argument. For gated activations whose gate is differentiable at the origin with

g(0) = 0

, including SwiGLU and GeGLU, a linearization argument gives the same conclusion. These impossibility results extend to arbitrary depth: a composition of

L

residual blocks using such activations cannot be replicated by any composition of

L

residual-free blocks of the same width. For ungated ReLU and GELU, the situation is richer. For generic weight matrices, absorption holds at the single-block level if and only if there exists an index set

S

of size at least

d

such that

W_{\mathrm{down}}[:,S]\,W_{\mathrm{up}}[S,:] = -I_d

. This condition is non-generic (it fails with probability one under continuous weight distributions), so skip-connected and residual-free MLPs of the same width represent generically disjoint function classes. Whether this disjointness persists for deep compositions of ReLU or GELU blocks remains open.

Claude Haiku for Low-Cost AI Inference: Patterns from a Horse Racing Prediction System

Dev.to

How We Built an Ambient AI Clinical Documentation Pipeline (and Saved Doctors 8+ Hours a Week)

Dev.to

🦀 PicoClaw Deep Dive — A Field Guide to Building an Ultra-Light AI Agent in Go 🐹

Dev.to

Real-Time Monitoring for AI Agents: Beyond Log Streaming

Dev.to

Top 10 Physical AI Models Powering Real-World Robots in 2026

MarkTechPost

Can an MLP Absorb Its Own Skip Connection?

Key Points

Abstract

Related Articles

Claude Haiku for Low-Cost AI Inference: Patterns from a Horse Racing Prediction System

How We Built an Ambient AI Clinical Documentation Pipeline (and Saved Doctors 8+ Hours a Week)

🦀 PicoClaw Deep Dive — A Field Guide to Building an Ultra-Light AI Agent in Go 🐹

Real-Time Monitoring for AI Agents: Beyond Log Streaming

Top 10 Physical AI Models Powering Real-World Robots in 2026

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer