Think First, Diffuse Fast: Improving Diffusion Language Model Reasoning via Autoregressive Plan Conditioning

arXiv cs.AI / 3/17/2026

📰 NewsModels & Research

共有:

Key Points

Plan conditioning prepends a ~100-token natural-language plan from an autoregressive model to the diffusion model's prompt, creating a frozen, globally visible scaffold that every token position can attend to from the first denoising step.
On GSM8K, LLaDA-8B-Instruct improves from 75.6% to 87.2% and matches LLaMA 3.1 8B (87.7%), with a substantially stronger gain than the baseline; on HumanEval, accuracy rises from 37.2% to 50.0%, showing the method generalizes to code.
Diffusion models gain 2-10x more from plan conditioning than autoregressive baselines, supporting the coordination-problem hypothesis; ablations show wrong plans hurt (-16.3pp) while perturbing plan values has a small effect (-1.1pp), and plan quality has a sharp threshold for effectiveness. Attention analysis reveals plan tokens receive 1.8x excess attention early in denoising, which normalizes as completion tokens solidify.
Plan conditioning costs about $0.002 per problem and ~2 seconds of added latency; across five random seeds for GSM8K, accuracy shows zero standard deviation, indicating highly stable diffusion inference.

Abstract

Diffusion large language models (dLLMs) generate text via iterative denoising but consistently underperform on multi-step reasoning. We hypothesize this gap stems from a coordination problem: AR models build coherence token-by-token, while diffusion models must coordinate all positions simultaneously. We propose plan conditioning, a training-free method that prepends a short (~100-token) natural-language plan from an AR model to the diffusion model's prompt. The plan serves as a frozen scaffold -- globally visible context that every token position can attend to from the first denoising step. On GSM8K, plan conditioning improves LLaDA-8B-Instruct from 75.6% to 87.2% (+11.6 percentage points), matching a same-size AR model (LLaMA 3.1 8B, 87.7%) despite a 6.4pp weaker baseline. On HumanEval, the gain is +12.8pp (37.2% to 50.0%), showing plans generalize to code. The same plans improve LLaMA by only +5.7pp on GSM8K and +1.3pp on HumanEval -- diffusion models benefit 2-10x more, supporting the coordination-problem hypothesis. Across 5 random seeds, plan-conditioned GSM8K accuracy has zero standard deviation, making diffusion inference highly stable. Ablations reveal the model follows plan strategy (wrong-strategy plans cause -16.3pp) but is robust to plan values (perturbed numbers: -1.1pp), and that planner quality has a sharp threshold: smaller Llama-class plans hurt (-1.6 to -6.8pp) while frontier plans provide the full lift. Attention analysis confirms the mechanism: plan tokens receive 1.8x excess attention during early denoising, declining to uniform as completion tokens solidify. Plan conditioning costs ~$0.002 per problem and adds ~2s of latency.

How political censorship actually works inside Qwen, DeepSeek, GLM, and Yi: Ablation and behavioral results across 9 models

Reddit r/LocalLLaMA

OpenSeeker's open-source approach aims to break up the data monopoly for AI search agents

THE DECODER

How to Choose the Best AI Chat Models of 2026 for Your Business Needs

Dev.to

I built an AI that generates lesson plans in your exact teaching voice (open source)

Dev.to

6-Band Prompt Decomposition: The Complete Technical Guide

Dev.to

Think First, Diffuse Fast: Improving Diffusion Language Model Reasoning via Autoregressive Plan Conditioning

Key Points

Abstract

Related Articles

How political censorship actually works inside Qwen, DeepSeek, GLM, and Yi: Ablation and behavioral results across 9 models

OpenSeeker's open-source approach aims to break up the data monopoly for AI search agents

How to Choose the Best AI Chat Models of 2026 for Your Business Needs

I built an AI that generates lesson plans in your exact teaching voice (open source)

6-Band Prompt Decomposition: The Complete Technical Guide

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer