Mixture-of-Depths Attention - arXiv

Reddit r/LocalLLaMA / 4/20/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper addresses a common depth-scaling problem in LLMs where deeper layers can experience signal degradation, diluting informative features from shallower layers.
It proposes Mixture-of-Depths Attention (MoDA), letting each attention head mix key/value pairs from the current layer with key/value pairs from preceding layers.
The authors introduce a hardware-efficient MoDA algorithm that handles non-contiguous memory access and reaches 97.3% of FlashAttention-2’s efficiency at sequence length 64K.
Experiments on 1.5B-parameter models show MoDA improves average perplexity by 0.2 across 10 validation benchmarks and boosts average downstream performance by 2.11% across 10 tasks, with only a 3.7% FLOPs overhead.
The results also indicate that MoDA combined with post-norm works better than MoDA with pre-norm, suggesting MoDA as a promising building block for depth scaling.

Scaling depth is a key driver for large language models (LLMs). Yet, as LLMs become deeper, they often suffer from signal degradation: informative features formed in shallow layers are gradually diluted by repeated residual updates, making them harder to recover in deeper layers. We introduce mixture-of-depths attention (MoDA), a mechanism that allows each attention head to attend to sequence KV pairs at the current layer and depth KV pairs from preceding layers. We further describe a hardware-efficient algorithm for MoDA that resolves non-contiguous memory-access patterns, achieving 97.3% of FlashAttention-2's efficiency at a sequence length of 64K. Experiments on 1.5B-parameter models demonstrate that MoDA consistently outperforms strong baselines. Notably, it improves average perplexity by 0.2 across 10 validation benchmarks and increases average performance by 2.11% on 10 downstream tasks, with a negligible 3.7% FLOPs computational overhead. We also find that combining MoDA with post-norm yields better performance than using it with pre-norm. These results suggest that MoDA is a promising primitive for depth scaling.

Paper : https://arxiv.org/abs/2603.15619

Code : https://github.com/hustvl/MoDA

Blog : https://lh-zhu.github.io/The-Second-Half-of-Model-Architecture/

Via Source Tweet #JustSharing

submitted by /u/pmttyji
[link] [comments]