Super Apriel: One Checkpoint, Many Speeds

arXiv cs.LG / 4/23/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The paper introduces Super Apriel, a 15B-parameter supernet where each decoder layer can switch among four attention/mixer options—Full Attention, Sliding Window Attention, Kimi Delta Attention, and Gated DeltaNet—at serving time using a single shared checkpoint.
  • Because mixer placements can be changed without reloading weights, the same model checkpoint supports multiple speed “presets,” and it also enables speculative decoding without needing a separate draft model.
  • Performance results show the all-FA configuration matches the Apriel 1.6 teacher on reported benchmarks, while hybrid presets trade off quality retention (96% to 77%) for decode throughput gains (2.9× to 10.7×) that become even more pronounced at longer context lengths.
  • The authors use a placement-quality surrogate model to make the large configuration space tractable, study how quickly optimal placements can be identified during training, and find that rankings stabilize early at 0.5B scale but optimal high-efficiency setups can be more unstable at 15B.
  • Super Apriel includes released artifacts such as supernet weights, Fast-LLM training code, vLLM serving code, and a placement optimization toolkit, supporting practical deployment and further experimentation.

Abstract

We release Super Apriel, a 15B-parameter supernet in which every decoder layer provides four trained mixer choices -- Full Attention (FA), Sliding Window Attention (SWA), Kimi Delta Attention (KDA), and Gated DeltaNet (GDN). A placement selects one mixer per layer; placements can be switched between requests at serving time without reloading weights, enabling multiple speed presets from a single checkpoint. The shared checkpoint also enables speculative decoding without a separate draft model. The all-FA preset matches the Apriel 1.6 teacher on all reported benchmarks; recommended hybrid presets span 2.9\times to 10.7\times decode throughput at 96% to 77% quality retention, with throughput advantages that compound at longer context lengths. With four mixer types across 48 layers, the configuration space is vast. A surrogate that predicts placement quality from the per-layer mixer assignment makes the speed-quality landscape tractable and identifies the best tradeoffs at each speed level. We investigate whether the best configurations at each speed level can be identified early in training or only after convergence. Rankings stabilize quickly at 0.5B scale, but the most efficient configurations exhibit higher instability at 15B, cautioning against extrapolation from smaller models. Super Apriel is trained by stochastic distillation from a frozen Apriel 1.6 teacher, followed by supervised fine-tuning. We release the supernet weights, Fast-LLM training code, vLLM serving code, and a placement optimization toolkit.