AI Navigate

Residual connections haven't changed for 10 years and Kimi just replaced them with attention

Reddit r/LocalLLaMA / 3/16/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The article explains how Attention Residuals replaces traditional residual connections by using a per-layer learned query to attend over previous layer outputs, yielding input-dependent routing.
  • In scaling experiments, Block AttnRes matches the loss of a baseline trained with 1.25x more compute, and when integrated into a 48B-parameter Kimi Linear model trained on 1.4T tokens, it achieves notable gains on GPQA-Diamond (+7.5), Math (+3.6), and HumanEval (+3.1).
  • The approach adds modest overhead, with under 4% extra training cost under pipeline parallelism and under 2% additional inference latency.
  • Karpathy participated in the discussion 'Attention is all you need!', and the article includes a visualization image sourced from a linked X post.
Residual connections haven't changed for 10 years and Kimi just replaced them with attention

In standard residual connections, each layer simply adds its output to the sum of all previous layers with equal weight, no selectivity at all. Attention Residuals replaces this with a softmax attention mechanism: each layer gets a single learned query vector that attends over all previous layer outputs, producing input-dependent weights that let the layer selectively retrieve what it actually needs.

On scaling law experiments, Block AttnRes achieves the same loss as a baseline trained with 1.25x more compute. Integrated into a 48B-parameter (3B activated) Kimi Linear model trained on 1.4T tokens, it improves across all evaluated benchmarks: GPQA-Diamond +7.5, Math +3.6, and HumanEval +3.1. The overhead is minimal: less than 4% additional training cost under pipeline parallelism, and under 2% inference latency increase.

Karpathy also participated in the discussion "Attention is all you need!"

Source of the visualization image: https://x.com/eliebakouch/status/2033488233854620007?s=20

submitted by /u/Helpful-Guava7452
[link] [comments]