Supernodes and Halos: Loss-Critical Hubs in LLM Feed-Forward Layers
arXiv cs.LG / 4/28/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- The paper analyzes transformer feed-forward networks (FFNs) and shows that loss sensitivity is concentrated in a small fraction of channels, using a Fisher-style loss proxy based on activation-gradient second moments.
- In Llama-3.1-8B, the top 1% of channels per layer accounts for a median of 58.7% of the loss-proxy (LP) mass, and the authors term these channels “supernodes.”
- “Supernodes” only weakly overlap with activation-defined outliers and are not explained solely by activation power or weight norms, indicating a distinct loss-critical structure.
- Beyond the supernode core, the authors observe a weaker “halo” where some non-supernode channels share write support and exhibit redundancy with the protected core.
- One-shot structured FFN pruning experiments show that protecting supernodes (SCAR-Prot) preserves performance far better than baselines (perplexity 54.8 vs 989.2 at 50% sparsity), with similar LP-concentration patterns across multiple LLM families and scaling behavior during pretraining.
Related Articles
How I Automate My Dev Workflow with Claude Code Hooks
Dev.to
Claude Haiku for Low-Cost AI Inference: Patterns from a Horse Racing Prediction System
Dev.to
How We Built an Ambient AI Clinical Documentation Pipeline (and Saved Doctors 8+ Hours a Week)
Dev.to
🦀 PicoClaw Deep Dive — A Field Guide to Building an Ultra-Light AI Agent in Go 🐹
Dev.to
Real-Time Monitoring for AI Agents: Beyond Log Streaming
Dev.to