Dynamic sparsity in tree-structured feed-forward layers at scale

arXiv cs.AI / 2026/4/13

💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

要点

The paper proposes tree-structured sparse feed-forward (MLP) layers as drop-in replacements for transformer MLP blocks, using hard hierarchical routing for conditional computation without a separate router network.
Experiments show that, for autoregressive language modeling and question answering (including zero- and few-shot), models activate under 5% of MLP units per token while matching dense baselines under controlled training and fine-tuning.
The approach is demonstrated to scale beyond 1B parameters, indicating the method can work in large model regimes rather than only toy settings.
The authors analyze training dynamics and find an emergent auto-pruning effect where hard routing plus asymmetric nonlinearities gradually deactivates unused paths, partially turning dynamic routing into static sparsity.
Simple architectural tweaks can modulate this pruning behavior, recovering more balanced trees without auxiliary losses, making the sparsification controllable.
Overall, the work positions tree-structured conditional sparsity as a scalable mechanism to reduce transformer compute while preserving performance.

Abstract

At typical context lengths, the feed-forward MLP block accounts for a large share of a transformer's compute budget, motivating sparse alternatives to dense MLP blocks. We study sparse, tree-structured feed-forward layers as drop-in replacements for MLP blocks in deep transformer architectures, enabling conditional computation via hard hierarchical routing without a separate router network. We demonstrate for the first time that this form of tree-structured conditional sparsity can be applied for autoregressive language modeling and downstream question answering, including zero- and few-shot settings, and its scalability beyond 1B parameters. Despite activating fewer than 5% of the feed-forward block's units per token, our models match dense baselines under controlled training and fine-tuning protocols. We further analyze training dynamics and identify an emergent auto-pruning effect: the interaction of hard routing with asymmetric nonlinearities progressively deactivates unused paths, yielding partial conversion of dynamic routing into static structural sparsity. We show that simple architectural choices can modulate this behavior and recover balanced trees without auxiliary losses. Overall, our work demonstrates that tree-structured feed-forward layers provide a scalable and controllable mechanism for sparsifying large transformer models.

Black Hat Asia

AI Business

もるこ🍒🐈スマホで1日10分副業🎵AI（ChatGPT）活用で月収10万円を目指す！

note

現状AIはどれくらいの速度で進化しているのか

note

【SUNO曲】AI彼氏の歌

note

Copilotと物語を作ってみた #225 幼馴染は今日も「あなたの子を身籠ったの」と言う

note

Dynamic sparsity in tree-structured feed-forward layers at scale

要点

Abstract

関連記事

Black Hat Asia

もるこ🍒🐈スマホで1日10分副業🎵AI（ChatGPT）活用で月収10万円を目指す！

現状AIはどれくらいの速度で進化しているのか

【SUNO曲】AI彼氏の歌

Copilotと物語を作ってみた #225 幼馴染は今日も「あなたの子を身籠ったの」と言う

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer