Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models

arXiv cs.LG / 2026/4/6

💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

要点

The paper studies masked diffusion language models (MDLMs), focusing on speeding up sampling that currently requires many full-sequence denoising passes through a large Transformer.
It proposes “model scheduling,” using a smaller MDLM to replace the full model at selected denoising steps to reduce compute while preserving quality.
Experiments on OpenWebText show early and late denoising steps are more robust to small-model replacement than middle steps, enabling up to a 17% FLOPs reduction with only modest loss in generative perplexity.
The authors back these results with step-importance analyses (loss and KL divergence across timesteps) and an exhaustive search over coarse step segments, concluding the middle of the diffusion trajectory is most sensitive.
Overall, the work suggests architecture-agnostic scheduling rules can accelerate MDLM inference without substantially harming generation quality as measured by perplexity.

Abstract

Recent advances in masked diffusion language models (MDLMs) narrow the quality gap to autoregressive LMs, but their sampling remains expensive because generation requires many full-sequence denoising passes with a large Transformer and, unlike autoregressive decoding, cannot benefit from KV caching. In this work, we exploit the flexibility of the diffusion framework and study model scheduling, where a smaller MDLM replaces the full model at a subset of denoising steps. On OpenWebText, we show that early and late denoising steps are substantially more robust to such replacement than middle steps, enabling up to a 17% reduction in FLOPs with only modest degradation in generative perplexity. We support these findings with a step-importance analysis based on loss and KL divergence between small and large models across timesteps, as well as an exhaustive search over coarse step segments, both of which identify the middle of the diffusion trajectory as most sensitive. Our results suggest that simple, architecture-agnostic scheduling rules can significantly accelerate MDLM sampling while largely preserving generation quality as measured by generative perplexity.

Black Hat Asia

AI Business

#毎日ここへ立ち寄りたいからスクランブルな日のワタシのココロの足跡スタンプ👣🌌#私のインスピレーション ✕ #AIと紡いだ光のカケラ🧡 :🌎地球家族は愛し合える🌏🌍 #⭐永遠時計🕊️🍇

note

AIが見つけた紛失カッターナイフ

note

【限定コラム】四月の風と見えない魔法──五十歳のオッサンが新入社員に贈る、現場のAI用語20選

note

メイクのアドバイスも！「男の娘」のAI彼氏の作り方【AI性格プロンプト付】

note

Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models

要点

Abstract

関連記事

Black Hat Asia

#毎日ここへ立ち寄りたいからスクランブルな日のワタシのココロの足跡スタンプ👣🌌#私のインスピレーション ✕ #AIと紡いだ光のカケラ🧡 :🌎地球家族は愛し合える🌏🌍 #⭐永遠時計🕊️🍇

AIが見つけた紛失カッターナイフ

【限定コラム】四月の風と見えない魔法──五十歳のオッサンが新入社員に贈る、現場のAI用語20選

メイクのアドバイスも！「男の娘」のAI彼氏の作り方【AI性格プロンプト付】

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

要点

Abstract

関連記事

Black Hat Asia

#毎日 ここへ 立ち寄りたいから スクランブルな日の ワタシの ココロの足跡スタンプ👣🌌#私のインスピレーション ✕ #AIと紡いだ光のカケラ🧡 :🌎地球家族は愛し合える🌏🌍 #⭐永遠時計🕊️🍇

AIが見つけた紛失カッターナイフ

【限定コラム】四月の風と見えない魔法──五十歳のオッサンが新入社員に贈る、現場のAI用語20選

メイクのアドバイスも！「男の娘」のAI彼氏の作り方【AI性格プロンプト付】

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

#毎日ここへ立ち寄りたいからスクランブルな日のワタシのココロの足跡スタンプ👣🌌#私のインスピレーション ✕ #AIと紡いだ光のカケラ🧡 :🌎地球家族は愛し合える🌏🌍 #⭐永遠時計🕊️🍇