Kimi (moonshot ai) dropped a paper on something called "attention residuals" that replaces the standard residual connection thats been in every transformer since resnet in 2015.
The tldr: normal residual connections just stack everything from all previous layers together. layer 40 gets the accumulated output of layers 1-39 all piled up. the deeper you go the more diluted earlier information gets. kimi calls this the "dilution problem."
Their fix is to let each layer selectively attend to outputs from all previous layers instead of just taking the sum. basically each layer gets to pick which earlier layers matter most for the current input, using learned attention weights.
Results on their benchmarks:
- 3-7.5 point improvements on grad level exams, math reasoning, code gen, long context tasks
- saves ~1.25x compute with their block version
- training overhead under 4%, inference latency increase under 2%
- scales well, bigger models benefit more
They also did a "block attention residual" variant where layers are grouped into blocks. within a block its normal residual, between blocks its attention based. this keeps most of the benefit while being way cheaper to run.
Whats interesting is deepseek also tried to fix residual connections recently with their mHC approach but went a completely different direction. deepseek adds parallel streams, kimi adds selective attention. someone compared them and kimis approach apparently needs 1/6 the memory bandwidth of deepseek mHC while getting similar or better results.
The practical implication: kimis version is supposedly drop in replaceable. you swap the residual module, keep everything else the same, retrain, and get improvements. deepseek mHC requires restructuring the whole model architecture.
Karpathy commented on this saying maybe attention can be applied to more places in the transformer than we thought. which is an interesting direction.
For local model people this matters because if this gets adopted by open weight models, we could see meaningful quality improvements without needing bigger models. same parameter count, better information flow, better results.
The paper has code on github (MoonshotAI/Attention-Residuals). would be cool to see someone test it on a 7b or 13b and check if improvements hold at smaller scales.
One thing im wondering about is quantization interaction. if the attention weights between layers are sensitive to precision, quant might hurt more than usual with this architecture.
Been testing various models through verdent lately and the quality gap between architectures is getting more noticeable than the gap between parameter counts. feels like architecture innovation matters more than just scaling up at this point.
Paper link: github.com/MoonshotAI/Attention-Residuals
[link] [comments]


