Attention to Mamba: A Recipe for Cross-Architecture Distillation
arXiv cs.LG / 4/17/2026
💬 OpinionModels & Research
Key Points
- The paper addresses how to distill a pretrained Transformer into a Mamba-like state space model (SSM) while avoiding the performance drop seen in naive cross-architecture distillation methods.
- It proposes a principled two-stage distillation recipe: first distilling from a Transformer into a linearized attention variant (via a kernel-trick-style adaptation), then distilling that linearized form into an adapted Mamba model with no attention blocks.
- With this approach, the distilled Mamba model preserves much of the original teacher quality, achieving downstream perplexity of 14.11 versus the teacher’s 13.86 on a Pythia-1B baseline.
- The authors validate the recipe through extensive ablations at ~1B scale (with 10B tokens), exploring different sequence-mixer architectures, scaling behavior across model sizes, total distillation token budgets, and sensitivity to how tokens are allocated between the two stages.
Related Articles
langchain-anthropic==1.4.1
LangChain Releases

Talk to Your Favorite Game Characters! Mantella Brings AI to Skyrim and Fallout 4 NPCs
Dev.to

OpenAI Codex Update Adds macOS Agent, Browser, Memory; 3M Weekly Users
Dev.to
1.14.2
CrewAI Releases

Should my enterprise AI agent do that? NanoClaw and Vercel launch easier agentic policy setting and approval dialogs across 15 messaging apps
VentureBeat