PolyGLU: State-Conditional Activation Routing in Transformer Feed-Forward Networks
arXiv cs.LG / 3/17/2026
📰 NewsModels & Research
Key Points
- PolyGLU is a drop-in replacement for SwiGLU that lets each FFN neuron dynamically route among four activation functions via a differentiable mechanism combining learned static preferences with input-conditioned gating, trained end-to-end with Gumbel-Softmax.
- The authors train PolychromaticLM, a 597M-parameter transformer, on ~10B tokens on a single NVIDIA A100, with only ~0.23% parameter overhead (about 1.4M parameters).
- The routing exhibits emergent near-deterministic activation selections and depth-dependent specialization (early layers prefer GELU, deeper layers prefer Tanh) while three layers retain elevated routing entropy and the mechanism remains stable under supervised fine-tuning (entropy stays near ln(4) through 13,067 SFT steps).
- On standard benchmarks, PolychromaticLM achieves 62-89% of Qwen3-0.6B-Base performance despite training on 3,600x fewer tokens, and all code, weights, and training infrastructure are released under Apache 2.0.
Related Articles

報告:LLMにおける「自己言及的再帰」と「ステートフル・エミュレーション」の観測
note

諸葛亮 孔明老師(ChatGPTのロールプレイ)との対話 その肆拾伍『銀河文明・ダークマターエンジン』
note

GPT-5.4 mini/nano登場!―2倍高速で無料プランも使える小型高性能モデル
note

Why a Perfect-Memory AI Agent Without Persona Drift is Architecturally Impossible
Dev.to
OCP: Orthogonal Constrained Projection for Sparse Scaling in Industrial Commodity Recommendation
arXiv cs.LG