Kimi just published a paper replacing residual connections in transformers. results look legit

Reddit r/LocalLLaMA / 3/20/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Kimi (Moonshot AI) introduces attention residuals that replace the standard residual connections in transformers and address the dilution problem where deeper layers dilute earlier information.
Each layer can selectively attend to outputs from all previous layers via learned attention weights rather than simply summing past representations.
They report 3-7.5 point improvements on grad-level exams, math reasoning, code generation, and long-context tasks, along with about 1.25x compute savings and training overhead under 4% with an inference latency increase of under 2%.
They also propose a block attention residual variant that groups layers into blocks, using normal residuals within blocks and attention-based connections between blocks to preserve benefits at lower cost.
Compared with DeepSeek's mHC, Kimi's approach reportedly uses about one-sixth the memory bandwidth while achieving similar or better results, and the code is available at MoonshotAI/Attention-Residuals on GitHub for testing on 7B/13B-scale models and to explore quantization interactions.

Kimi (moonshot ai) dropped a paper on something called "attention residuals" that replaces the standard residual connection thats been in every transformer since resnet in 2015.

The tldr: normal residual connections just stack everything from all previous layers together. layer 40 gets the accumulated output of layers 1-39 all piled up. the deeper you go the more diluted earlier information gets. kimi calls this the "dilution problem."

Their fix is to let each layer selectively attend to outputs from all previous layers instead of just taking the sum. basically each layer gets to pick which earlier layers matter most for the current input, using learned attention weights.

Results on their benchmarks:

- 3-7.5 point improvements on grad level exams, math reasoning, code gen, long context tasks

- saves ~1.25x compute with their block version

- training overhead under 4%, inference latency increase under 2%

- scales well, bigger models benefit more

They also did a "block attention residual" variant where layers are grouped into blocks. within a block its normal residual, between blocks its attention based. this keeps most of the benefit while being way cheaper to run.

Whats interesting is deepseek also tried to fix residual connections recently with their mHC approach but went a completely different direction. deepseek adds parallel streams, kimi adds selective attention. someone compared them and kimis approach apparently needs 1/6 the memory bandwidth of deepseek mHC while getting similar or better results.

The practical implication: kimis version is supposedly drop in replaceable. you swap the residual module, keep everything else the same, retrain, and get improvements. deepseek mHC requires restructuring the whole model architecture.

Karpathy commented on this saying maybe attention can be applied to more places in the transformer than we thought. which is an interesting direction.

For local model people this matters because if this gets adopted by open weight models, we could see meaningful quality improvements without needing bigger models. same parameter count, better information flow, better results.

The paper has code on github (MoonshotAI/Attention-Residuals). would be cool to see someone test it on a 7b or 13b and check if improvements hold at smaller scales.

One thing im wondering about is quantization interaction. if the attention weights between layers are sensitive to precision, quant might hurt more than usual with this architecture.

Been testing various models through verdent lately and the quality gap between architectures is getting more noticeable than the gap between parameter counts. feels like architecture innovation matters more than just scaling up at this point.

Paper link: github.com/MoonshotAI/Attention-Residuals

submitted by /u/Simple_Response8041
[link] [comments]

How We Built ScholarNet AI: An AI-Powered Study Platform for Students

Dev.to

Using Notion MCP: Building a Personal AI 'OS' to Claim Back Your Morning

Dev.to

The LiteLLM Attack Exposed a Bigger Problem: Your Vibe-Coded App Probably Has the Same Vulnerabilities

Dev.to

Why Your Claude-Assisted Project Falls Apart After Week 3 (And How to Fix It)

Dev.to

LatentQA: Teaching LLMs to Decode Activations Into Natural Language

arXiv cs.CL

Kimi just published a paper replacing residual connections in transformers. results look legit

Key Points

Related Articles

How We Built ScholarNet AI: An AI-Powered Study Platform for Students

Using Notion MCP: Building a Personal AI 'OS' to Claim Back Your Morning

The LiteLLM Attack Exposed a Bigger Problem: Your Vibe-Coded App Probably Has the Same Vulnerabilities

Why Your Claude-Assisted Project Falls Apart After Week 3 (And How to Fix It)

LatentQA: Teaching LLMs to Decode Activations Into Natural Language

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer