Think First, Diffuse Fast: Improving Diffusion Language Model Reasoning via Autoregressive Plan Conditioning
arXiv cs.AI / 3/17/2026
📰 NewsModels & Research
Key Points
- Plan conditioning prepends a ~100-token natural-language plan from an autoregressive model to the diffusion model's prompt, creating a frozen, globally visible scaffold that every token position can attend to from the first denoising step.
- On GSM8K, LLaDA-8B-Instruct improves from 75.6% to 87.2% and matches LLaMA 3.1 8B (87.7%), with a substantially stronger gain than the baseline; on HumanEval, accuracy rises from 37.2% to 50.0%, showing the method generalizes to code.
- Diffusion models gain 2-10x more from plan conditioning than autoregressive baselines, supporting the coordination-problem hypothesis; ablations show wrong plans hurt (-16.3pp) while perturbing plan values has a small effect (-1.1pp), and plan quality has a sharp threshold for effectiveness. Attention analysis reveals plan tokens receive 1.8x excess attention early in denoising, which normalizes as completion tokens solidify.
- Plan conditioning costs about $0.002 per problem and ~2 seconds of added latency; across five random seeds for GSM8K, accuracy shows zero standard deviation, indicating highly stable diffusion inference.
Related Articles
How political censorship actually works inside Qwen, DeepSeek, GLM, and Yi: Ablation and behavioral results across 9 models
Reddit r/LocalLLaMA

OpenSeeker's open-source approach aims to break up the data monopoly for AI search agents
THE DECODER

How to Choose the Best AI Chat Models of 2026 for Your Business Needs
Dev.to

I built an AI that generates lesson plans in your exact teaching voice (open source)
Dev.to

6-Band Prompt Decomposition: The Complete Technical Guide
Dev.to