Beyond Masks: Efficient, Flexible Diffusion Language Models via Deletion-Insertion Processes

arXiv cs.LG / 3/26/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Deletion-Insertion Diffusion language models (DID) that reformulate token deletion and insertion as discrete diffusion processes to replace token masking/unmasking in masked diffusion language models (MDLMs).
  • DID aims to improve computational efficiency by removing overhead from non-informative <MASK> token computations and from <PAD> token handling in variable-length generation.
  • The approach is designed to natively support variable-length sequences without fixed-length padding, and it adds an intrinsic self-correction capability during generation via insertion operations that adjust token positions.
  • Training is done using a score-based method that scores token insertion operations, with training objectives reduced to subsequence-counting problems solved via a parallelized dynamic programming algorithm.
  • Experiments (on both fixed and variable-length settings) report better modeling performance, sampling quality, and faster training/inference than MDLM baselines and existing insertion-based language models without hyperparameter tuning.

Abstract

While Masked Diffusion Language Models (MDLMs) relying on token masking and unmasking have shown promise in language modeling, their computational efficiency and generation flexibility remain constrained by the masking paradigm. In this paper, we propose Deletion-Insertion Diffusion language models (DID) that rigorously formulate token deletion and insertion as discrete diffusion processes, replacing the masking and unmasking processes in current MDLMs. DID improves training and inference efficiency by eliminating two major sources of computational overhead in MDLMs: the computations on non-informative 1) tokens inherent to the paradigm, and 2) tokens introduced in variable-length settings. Furthermore, DID offers greater flexibility by: 1) natively supporting variable-length sequences without requiring fixed-length padding, and 2) an intrinsic self-correction mechanism during generation due to insertion that dynamically adjusts token positions. To train DID, we design a score-based approach that assigns scores to token insertion operations and derive appropriate training objectives. The objectives involve subsequence counting problems, which we efficiently solve via a parallelized dynamic programming algorithm. Our experiments across fixed and variable-length settings demonstrate the advantage of DID over baselines of MDLMs and existing insertion-based LMs, in terms of modeling performance, sampling quality, and training/inference speed, without any hyperparameter tuning.