Attention to Mamba: A Recipe for Cross-Architecture Distillation

arXiv cs.LG / 4/17/2026

💬 OpinionModels & Research

共有:

Key Points

The paper addresses how to distill a pretrained Transformer into a Mamba-like state space model (SSM) while avoiding the performance drop seen in naive cross-architecture distillation methods.
It proposes a principled two-stage distillation recipe: first distilling from a Transformer into a linearized attention variant (via a kernel-trick-style adaptation), then distilling that linearized form into an adapted Mamba model with no attention blocks.
With this approach, the distilled Mamba model preserves much of the original teacher quality, achieving downstream perplexity of 14.11 versus the teacher’s 13.86 on a Pythia-1B baseline.
The authors validate the recipe through extensive ablations at ~1B scale (with 10B tokens), exploring different sequence-mixer architectures, scaling behavior across model sizes, total distillation token budgets, and sensitivity to how tokens are allocated between the two stages.

Abstract

State Space Models (SSMs) such as Mamba have become a popular alternative to Transformer models, due to their reduced memory consumption and higher throughput at generation compared to their Attention-based counterparts. On the other hand, the community has built up a considerable body of knowledge on how to train Transformers, and many pretrained Transformer models are readily available. To facilitate the adoption of SSMs while leveraging existing pretrained Transformers, we aim to identify an effective recipe to distill an Attention-based model into a Mamba-like architecture. In prior work on cross-architecture distillation, however, it has been shown that a na\"ive distillation procedure from Transformers to Mamba fails to preserve the original teacher performance, a limitation often overcome with hybrid solutions combining Attention and SSM blocks. The key argument from our work is that, by equipping Mamba with a principled initialization, we can recover an overall better recipe for cross-architectural distillation. To this end, we propose a principled two-stage approach: first, we distill knowledge from a traditional Transformer into a linearized version of Attention, using an adaptation of the kernel trick. Then, we distill the linearized version into an adapted Mamba model that does not use any Attention block. Overall, the distilled Mamba model is able to preserve the original Pythia-1B Transformer performance in downstream tasks, maintaining a perplexity of 14.11 close to the teacher's 13.86. To show the efficacy of our recipe, we conduct thorough ablations at 1B scale with 10B tokens varying sequence mixer architecture, scaling analysis on model sizes and total distillation tokens, and a sensitivity analysis on tokens allocation between stages.