When Does Sparsity Mitigate the Curse of Depth in LLMs

arXiv cs.CL / 3/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that sparsity helps mitigate the curse of depth in LLMs by regulating variance propagation, leading to better utilization of deeper layers.
It distinguishes implicit sparsity from training/data conditions (weight decay-induced weight sparsity, long-context attention sparsity) and explicit sparsity from architectural design (grouped-query attention key/value sharing, Mixture-of-Experts expert-activation sparsity).
Through controlled depth-scaling experiments, it shows sparsity reduces output variance and promotes functional differentiation across layers, consistently improving depth utilization.
The authors derive a practical rule-of-thumb for training depth-efficient LLMs and report about a 4.6% accuracy boost on downstream tasks.
The study provides open-source code implementing the methods at the given GitHub repository.

Abstract

Recent work has demonstrated the curse of depth in large language models (LLMs), where later layers contribute less to learning and representation than earlier layers. Such under-utilization is linked to the accumulated growth of variance in Pre-Layer Normalization, which can push deep blocks toward near-identity behavior. In this paper, we demonstrate that, sparsity, beyond enabling efficiency, acts as a regulator of variance propagation and thereby improves depth utilization. Our investigation covers two sources of sparsity: (i) implicit sparsity, which emerges from training and data conditions, including weight sparsity induced by weight decay and attention sparsity induced by long context inputs; and (ii) explicit sparsity, which is enforced by architectural design, including key/value-sharing sparsity in Grouped-Query Attention and expert-activation sparsity in Mixtureof-Experts. Our claim is thoroughly supported by controlled depth-scaling experiments and targeted layer effectiveness interventions. Across settings, we observe a consistent relationship: sparsity improves layer utilization by reducing output variance and promoting functional differentiation. We eventually distill our findings into a practical rule-of-thumb recipe for training deptheffective LLMs, yielding a notable 4.6% accuracy improvement on downstream tasks. Our results reveal sparsity, arising naturally from standard design choices, as a key yet previously overlooked mechanism for effective depth scaling in LLMs. Code is available at https://github.com/pUmpKin-Co/SparsityAndCoD.