Residual connections haven't changed for 10 years and Kimi just replaced them with attention

Reddit r/LocalLLaMA / 3/16/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The article explains how Attention Residuals replaces traditional residual connections by using a per-layer learned query to attend over previous layer outputs, yielding input-dependent routing.
In scaling experiments, Block AttnRes matches the loss of a baseline trained with 1.25x more compute, and when integrated into a 48B-parameter Kimi Linear model trained on 1.4T tokens, it achieves notable gains on GPQA-Diamond (+7.5), Math (+3.6), and HumanEval (+3.1).
The approach adds modest overhead, with under 4% extra training cost under pipeline parallelism and under 2% additional inference latency.
Karpathy participated in the discussion 'Attention is all you need!', and the article includes a visualization image sourced from a linked X post.

Residual connections haven't changed for 10 years and Kimi just replaced them with attention

In standard residual connections, each layer simply adds its output to the sum of all previous layers with equal weight, no selectivity at all. Attention Residuals replaces this with a softmax attention mechanism: each layer gets a single learned query vector that attends over all previous layer outputs, producing input-dependent weights that let the layer selectively retrieve what it actually needs.

On scaling law experiments, Block AttnRes achieves the same loss as a baseline trained with 1.25x more compute. Integrated into a 48B-parameter (3B activated) Kimi Linear model trained on 1.4T tokens, it improves across all evaluated benchmarks: GPQA-Diamond +7.5, Math +3.6, and HumanEval +3.1. The overhead is minimal: less than 4% additional training cost under pipeline parallelism, and under 2% inference latency increase.

Karpathy also participated in the discussion "Attention is all you need!"

Source of the visualization image: https://x.com/eliebakouch/status/2033488233854620007?s=20

submitted by /u/Helpful-Guava7452
[link] [comments]

Astral to Join OpenAI

Dev.to

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.

Reddit r/LocalLLaMA

Why Data is Important for LLM

Dev.to

The Inference Market Is Consolidating. Agent Payments Are Still Nobody's Problem.

Dev.to

YouTube's Deepfake Shield for Politicians Changes Evidence Forever

Dev.to

Residual connections haven't changed for 10 years and Kimi just replaced them with attention

Key Points

Related Articles

Astral to Join OpenAI

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.

Why Data is Important for LLM

The Inference Market Is Consolidating. Agent Payments Are Still Nobody's Problem.

YouTube's Deepfake Shield for Politicians Changes Evidence Forever

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer