| In standard residual connections, each layer simply adds its output to the sum of all previous layers with equal weight, no selectivity at all. Attention Residuals replaces this with a softmax attention mechanism: each layer gets a single learned query vector that attends over all previous layer outputs, producing input-dependent weights that let the layer selectively retrieve what it actually needs. On scaling law experiments, Block AttnRes achieves the same loss as a baseline trained with 1.25x more compute. Integrated into a 48B-parameter (3B activated) Kimi Linear model trained on 1.4T tokens, it improves across all evaluated benchmarks: GPQA-Diamond +7.5, Math +3.6, and HumanEval +3.1. The overhead is minimal: less than 4% additional training cost under pipeline parallelism, and under 2% inference latency increase. Karpathy also participated in the discussion "Attention is all you need!" Source of the visualization image: https://x.com/eliebakouch/status/2033488233854620007?s=20 [link] [comments] |
Residual connections haven't changed for 10 years and Kimi just replaced them with attention
Reddit r/LocalLLaMA / 3/16/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The article explains how Attention Residuals replaces traditional residual connections by using a per-layer learned query to attend over previous layer outputs, yielding input-dependent routing.
- In scaling experiments, Block AttnRes matches the loss of a baseline trained with 1.25x more compute, and when integrated into a 48B-parameter Kimi Linear model trained on 1.4T tokens, it achieves notable gains on GPQA-Diamond (+7.5), Math (+3.6), and HumanEval (+3.1).
- The approach adds modest overhead, with under 4% extra training cost under pipeline parallelism and under 2% additional inference latency.
- Karpathy participated in the discussion 'Attention is all you need!', and the article includes a visualization image sourced from a linked X post.
Related Articles
The Honest Guide to AI Writing Tools in 2026 (What Actually Works)
Dev.to
Next-Generation LLM Inference Technology: From Flash-MoE to Gemini Flash-Lite, and Local GPU Utilization
Dev.to
The Wave of Open-Source AI and Investment in Security: Trends from Qwen, MS, and Google
Dev.to
How I built a 4-product AI income stack in 4 months (the honest version)
Dev.to
I stopped writing AI prompts from scratch. Here is the system I built instead.
Dev.to