Neural Continuous-Time Markov Chain: Discrete Diffusion via Decoupled Jump Timing and Direction

arXiv cs.LG / 4/20/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper argues that many discrete diffusion approaches based on continuous-time Markov chains (CTMCs) parameterize the reverse dynamics as one monolithic object, instead of matching CTMC’s core decomposition into jump timing and jump direction.
  • It proposes “Neural CTMC,” which learns the reverse process with two separate network heads: an exit rate for when to jump and a jump distribution for where to jump.
  • The authors show the training objective can be expressed so that, up to a θ-independent constant, it is determined by the exit-rate and jump-distribution terms, and that the KL divergence cleanly factorizes into a Poisson KL (timing) and a categorical KL (direction).
  • They prove the use of a tractable conditional surrogate preserves the gradients and minimizers of the marginal reverse-process objective under standard assumptions, and extend the theory to masked and GIDD-style noise schedules.
  • Experiments claim their “pure-uniform” method beats mask-based methods on OpenWebText, and the authors release pretrained weights on Hugging Face for reproducibility.

Abstract

Discrete diffusion models based on continuous-time Markov chains (CTMCs) have shown strong performance on language and discrete data generation, yet existing approaches typically parameterize the reverse rate matrix as a single object -- via concrete scores, clean-data predictions (x_0-parameterization), or denoising distributions -- rather than aligning the parameterization with the intrinsic CTMC decomposition into jump timing and jump direction. Since a CTMC is fundamentally a Poisson process fully determined by these two quantities, decomposing along this structure is closer to first principles and naturally leads to our formulation. We propose \textbf{Neural CTMC}, which separately parameterizes the reverse process through an \emph{exit rate} (when to jump) and a \emph{jump distribution} (where to jump) using two dedicated network heads. We show that the evidence lower bound (ELBO) differs from a path-space KL divergence between the true and learned reverse processes by a \theta-independent constant, so that the training objective is fully governed by the exit rate and jump distribution we parameterize. Moreover, this KL factorizes into a Poisson KL for timing and a categorical KL for direction. We further show that the tractable conditional surrogate preserves the gradients and minimizers of the corresponding marginal reverse-process objective under standard regularity assumptions. Our theoretical framework also covers masked and GIDD-style noise schedules. Empirically, while the uniform forward process has been explored in prior work, our model, to our best of the knowledge, is the first pure-uniform method to outperform mask-based methods on the OpenWebText dataset.To facilitate reproducibility, we release our pretrained weights at https://huggingface.co/Jiangxy1117/Neural-CTMC.