When Perplexity Lies: Generation-Focused Distillation of Hybrid Sequence Models

arXiv cs.CL / 3/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that distilling Transformer models for efficient inference can significantly degrade generation quality if the student architecture and the distillation process are not co-designed for autoregressive generation rather than multiple-choice scoring.
  • It demonstrates that log-likelihood/perplexity-based evaluation can mask large real-world gaps: a distilled 7B model nearly matches its teacher under log-likelihood yet performs far worse when required to generate answers autoregressively.
  • The authors introduce the Hybrid-KDA architecture and a multi-stage distillation pipeline called GenDistill, using generation-based evaluation to guide design decisions throughout training.
  • Experiments on Qwen3-0.6B with systematic ablations show that dataset selection, completion-only loss masking, and freezing attention layers during post-training are among the most influential factors for improving generation quality.
  • The best Hybrid-KDA student retains 86–90% of teacher accuracy on knowledge benchmarks while reducing KV-cache memory by up to 75% and improving time-to-first-token by 2–4× at 128K-token contexts.

Abstract

Converting a pretrained Transformer into a more efficient hybrid model through distillation offers a promising approach to reducing inference costs. However, achieving high-quality generation in distilled models requires careful joint design of both the student architecture and the distillation process. Many prior distillation works evaluate downstream multiple-choice benchmarks by ranking candidate answers with log-likelihood rather than requiring autoregressive generation, which can obscure important differences in model quality. For example, we show that a 7B parameter distilled model that nearly matches its teacher to within 0.2\,pp under log-likelihood scoring actually falls behind by 20.8\,pp when the model must generate answers autoregressively. We propose a Hybrid Kimi Delta Attention (Hybrid-KDA) architecture paired with GenDistill, a multi-stage distillation pipeline, and use generation-based evaluation throughout to guide design decisions. Applying this approach to Qwen3-0.6B, we systematically ablate six design axes: training objective, loss masking, training duration, dataset selection, parameter freezing, and architecture choice. We find that log-likelihood-based evaluation consistently underestimates the gap between teacher and student, and can in some cases reverse the ranking of design choices, meaning that conclusions drawn from perplexity-only evaluation may be misleading. Among the factors we study, dataset selection, completion-only masking, and freezing attention layers during post-training have the largest impact on generation quality. Our best Hybrid-KDA model retains 86--90\% of teacher accuracy on knowledge benchmarks while reducing KV cache memory by up to 75\% and improving time-to-first-token by 2--4\times at 128K-token contexts.