Attention Residuals

arXiv cs.CL / 3/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces Attention Residuals (AttnRes), replacing fixed unit-weight accumulation of layer outputs with softmax attention across preceding layer representations to enable input-dependent, selective aggregation.
To address memory and communication costs in large-scale training, it proposes Block AttnRes, which partitions layers into blocks and attends over block-level representations.
The authors show scaling-law evidence that AttnRes benefits persist across model sizes and demonstrate improved gradient and activation distributions when integrated into a 48B total / 3B activated Kimi Linear architecture trained on 1.4T tokens.
Additional deployment considerations include cache-based pipeline communication and a two-phase computation strategy to make AttnRes a practical drop-in replacement with minimal overhead.

Abstract

Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer's contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights. To address the memory and communication overhead of attending over all preceding layer outputs for large-scale model training, we introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations, reducing the memory footprint while preserving most of the gains of full AttnRes. Combined with cache-based pipeline communication and a two-phase computation strategy, Block AttnRes becomes a practical drop-in replacement for standard residual connections with minimal overhead. Scaling law experiments confirm that the improvement is consistent across model sizes, and ablations validate the benefit of content-dependent depth-wise selection. We further integrate AttnRes into the Kimi Linear architecture (48B total / 3B activated parameters) and pre-train on 1.4T tokens, where AttnRes mitigates PreNorm dilution, yielding more uniform output magnitudes and gradient distribution across depth, and improves downstream performance across all evaluated tasks.

MCP Is Quietly Replacing APIs — And Most Developers Haven't Noticed Yet

Dev.to

I Built a Self-Healing AI Trading Bot That Learns From Every Failure

Dev.to

Stop Guessing Your API Costs: Track LLM Tokens in Real Time

Dev.to

We are building PixelRooms! The marketplace of AI teams for thepixeloffice.ai

Dev.to

Every real estate agent tool worth your time in 2026, ranked and rated

Dev.to

Attention Residuals

Key Points

Abstract

Related Articles

MCP Is Quietly Replacing APIs — And Most Developers Haven't Noticed Yet

I Built a Self-Healing AI Trading Bot That Learns From Every Failure

Stop Guessing Your API Costs: Track LLM Tokens in Real Time

We are building PixelRooms! The marketplace of AI teams for thepixeloffice.ai

Every real estate agent tool worth your time in 2026, ranked and rated

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer