| In standard residual connections, each layer simply adds its output to the sum of all previous layers with equal weight, no selectivity at all. Attention Residuals replaces this with a softmax attention mechanism: each layer gets a single learned query vector that attends over all previous layer outputs, producing input-dependent weights that let the layer selectively retrieve what it actually needs. On scaling law experiments, Block AttnRes achieves the same loss as a baseline trained with 1.25x more compute. Integrated into a 48B-parameter (3B activated) Kimi Linear model trained on 1.4T tokens, it improves across all evaluated benchmarks: GPQA-Diamond +7.5, Math +3.6, and HumanEval +3.1. The overhead is minimal: less than 4% additional training cost under pipeline parallelism, and under 2% inference latency increase. Karpathy also participated in the discussion "Attention is all you need!" Source of the visualization image: https://x.com/eliebakouch/status/2033488233854620007?s=20 [link] [comments] |
Residual connections haven't changed for 10 years and Kimi just replaced them with attention
Reddit r/LocalLLaMA / 3/16/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The article explains how Attention Residuals replaces traditional residual connections by using a per-layer learned query to attend over previous layer outputs, yielding input-dependent routing.
- In scaling experiments, Block AttnRes matches the loss of a baseline trained with 1.25x more compute, and when integrated into a 48B-parameter Kimi Linear model trained on 1.4T tokens, it achieves notable gains on GPQA-Diamond (+7.5), Math (+3.6), and HumanEval (+3.1).
- The approach adds modest overhead, with under 4% extra training cost under pipeline parallelism and under 2% additional inference latency.
- Karpathy participated in the discussion 'Attention is all you need!', and the article includes a visualization image sourced from a linked X post.
Related Articles

Astral to Join OpenAI
Dev.to

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.
Reddit r/LocalLLaMA

Why Data is Important for LLM
Dev.to

The Inference Market Is Consolidating. Agent Payments Are Still Nobody's Problem.
Dev.to

YouTube's Deepfake Shield for Politicians Changes Evidence Forever
Dev.to