AI Navigate

Ultra-Fast Language Generation via Discrete Diffusion Divergence Instruct

arXiv cs.CL / 3/13/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • Introduces Discrete Diffusion Divergence Instruct (DiDi-Instruct), a training-based method that distills a few-step student from a pre-trained diffusion LLM to enable fast generation.
  • Builds on integral KL-divergence minimization and adds grouped reward normalization, intermediate-state matching, and a reward-guided ancestral sampler to improve training stability, model coverage, and inference quality.
  • Demonstrates that the distilled model matches or surpasses its diffusion teacher and the GPT-2 baseline, while delivering up to 64x acceleration and more than 20x reduction in training time compared with prior distillation methods.
  • On OpenWebText, it reports perplexity improvements from 62.2 (8 NFEs) to 18.4 (128 NFEs), illustrating efficient performance across generation settings and robustness in downstream tasks and protein sequence generation.
  • Overall, the work argues that DiDi-Instruct enables efficient and effective distillation for language generation with practical impact on speed and resource use.

Abstract

Fast and high-quality language generation is the holy grail that people pursue in the age of AI. In this work, we introduce Discrete Diffusion Divergence Instruct (DiDi-Instruct), a training-based method that initializes from a pre-trained diffusion large language model (dLLM) and distills a few-step student for fast generation. The model distilled with DiDi-Instruct matches or surpasses its dLLM teacher and the GPT-2 baseline while providing up to 64\times acceleration. The theoretical foundation of DiDi-Instruct is a novel framework based on integral KL-divergence minimization, which leads to a practical training algorithm. We further introduce grouped reward normalization, intermediate-state matching, and the reward-guided ancestral sampler to improve training stability, model coverage, and inference quality. On the OpenWebText benchmark, DiDi-Instruct achieves perplexity ranging from 62.2 (8 NFEs) to 18.4 (128 NFEs), outperforming prior accelerated dLLMs and the GPT-2 baseline. These gains incur a negligible entropy loss (around 1%) and reduce additional training wall-clock time by more than 20\times compared to competing dLLM distillation methods. We further validate the robustness and effectiveness of DiDi-Instruct through extensive ablation studies, model scaling, downstream task evaluations, and unconditional protein sequence generation. In conclusion, DiDi-Instruct enables efficient and effective distillation for language generation in the blink of an eye.