Variational Autoencoding Discrete Diffusion with Enhanced Dimensional Correlations Modeling

arXiv stat.ML / 4/15/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces Variational Autoencoding Discrete Diffusion (VADD), a framework that combines discrete diffusion (in the style of masked diffusion models) with latent-variable modeling to better represent dependencies across multiple dimensions.
It addresses a key limitation of masked diffusion models: performance can drop when using only a few denoising steps because inter-dimensional correlations are not modeled well.
VADD adds an auxiliary recognition model and trains by maximizing variational lower bounds, enabling more stable training and amortized inference.
Experiments across 2D toy data, pixel-level image generation, and text generation show VADD improves sample quality compared with MDM baselines, particularly under low denoising-step budgets.
The authors claim VADD preserves the generation efficiency benefits of traditional MDMs while substantially raising output quality when denoising steps are limited.

Abstract

Discrete diffusion models have recently shown great promise for modeling complex discrete data, with masked diffusion models (MDMs) offering a compelling trade-off between quality and generation speed. MDMs denoise by progressively unmasking multiple dimensions from an all-masked input, but their performance can degrade when using few denoising steps due to limited modeling of inter-dimensional dependencies. In this paper, we propose Variational Autoencoding Discrete Diffusion (VADD), a novel framework that enhances discrete diffusion with latent variable modeling to implicitly capture correlations among dimensions. By introducing an auxiliary recognition model, VADD enables stable training via variational lower bounds maximization and amortized inference over the training set. Our approach retains the efficiency of traditional MDMs while significantly improving sample quality, especially when the number of denoising steps is small. Empirical results on 2D toy data, pixel-level image generation, and text generation demonstrate that VADD consistently outperforms MDM baselines in sample quality with few denoising steps.

Are gamers being used as free labeling labor? The rise of "Simulators" that look like AI training grounds [D]

Reddit r/MachineLearning

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Failure to Reproduce Modern Paper Claims [D]

Reddit r/MachineLearning

Why don’t they just use Mythos to fix all the bugs in Claude Code?

Reddit r/LocalLLaMA

Variational Autoencoding Discrete Diffusion with Enhanced Dimensional Correlations Modeling

Key Points

Abstract

Related Articles

Are gamers being used as free labeling labor? The rise of "Simulators" that look like AI training grounds [D]

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Failure to Reproduce Modern Paper Claims [D]

Why don’t they just use Mythos to fix all the bugs in Claude Code?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer