Dynamic sparsity in tree-structured feed-forward layers at scale
arXiv cs.AI / 2026/4/13
💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research
要点
- The paper proposes tree-structured sparse feed-forward (MLP) layers as drop-in replacements for transformer MLP blocks, using hard hierarchical routing for conditional computation without a separate router network.
- Experiments show that, for autoregressive language modeling and question answering (including zero- and few-shot), models activate under 5% of MLP units per token while matching dense baselines under controlled training and fine-tuning.
- The approach is demonstrated to scale beyond 1B parameters, indicating the method can work in large model regimes rather than only toy settings.
- The authors analyze training dynamics and find an emergent auto-pruning effect where hard routing plus asymmetric nonlinearities gradually deactivates unused paths, partially turning dynamic routing into static sparsity.
- Simple architectural tweaks can modulate this pruning behavior, recovering more balanced trees without auxiliary losses, making the sparsification controllable.
- Overall, the work positions tree-structured conditional sparsity as a scalable mechanism to reduce transformer compute while preserving performance.



