Revisiting Auxiliary Losses for Conditional Depth Routing: An Empirical Study
arXiv cs.LG / 4/21/2026
📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research
Key Points
- The paper investigates how auxiliary losses affect training stability for conditional depth token routing, where a gating module sends some tokens through a cheap FFN and the rest through a full FFN across controlled layers.
- It compares two gate designs (G1: MLP utility scoring vs. G3: JEPA-guided action-conditional prediction) on a 157.5M decoder-only model with controller-only training and a 50% full-path budget, finding G3 improves early-to-mid optimization in 3/3 runs under a standard util/rank auxiliary-loss recipe.
- Ablation results show that removing the util/rank auxiliary supervision improves best/average LM and threshold-hit speed for both gates, and the earlier advantage of G3 over G1 disappears.
- The authors attribute the utility/rank losses’ negative effect to an off-policy oracle labeling assumption (that all subsequent layers run full) which mismatches gated execution that routes only part of the tokens through the full path.
- Eliminating util/rank also reduces the training compute proxy (about 1.53× to 1.07× full-only), suggesting a practical efficiency benefit within the tested regime.
Related Articles

A practical guide to getting comfortable with AI coding tools
Dev.to

Every time a new model comes out, the old one is obsolete of course
Reddit r/LocalLLaMA

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆
Dev.to

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)
Dev.to

🚀 Major BrowserAct CLI Update
Dev.to