Think First, Diffuse Fast: Improving Diffusion Language Model Reasoning via Autoregressive Plan Conditioning
arXiv cs.AI / 3/17/2026
📰 NewsModels & Research
Key Points
- Plan conditioning prepends a ~100-token natural-language plan from an autoregressive model to the diffusion model's prompt, creating a frozen, globally visible scaffold that every token position can attend to from the first denoising step.
- On GSM8K, LLaDA-8B-Instruct improves from 75.6% to 87.2% and matches LLaMA 3.1 8B (87.7%), with a substantially stronger gain than the baseline; on HumanEval, accuracy rises from 37.2% to 50.0%, showing the method generalizes to code.
- Diffusion models gain 2-10x more from plan conditioning than autoregressive baselines, supporting the coordination-problem hypothesis; ablations show wrong plans hurt (-16.3pp) while perturbing plan values has a small effect (-1.1pp), and plan quality has a sharp threshold for effectiveness. Attention analysis reveals plan tokens receive 1.8x excess attention early in denoising, which normalizes as completion tokens solidify.
- Plan conditioning costs about $0.002 per problem and ~2 seconds of added latency; across five random seeds for GSM8K, accuracy shows zero standard deviation, indicating highly stable diffusion inference.


![[Boost]](/_next/image?url=https%3A%2F%2Fmedia2.dev.to%2Fdynamic%2Fimage%2Fwidth%3D800%252Cheight%3D%252Cfit%3Dscale-down%252Cgravity%3Dauto%252Cformat%3Dauto%2Fhttps%253A%252F%252Fdev-to-uploads.s3.amazonaws.com%252Fuploads%252Fuser%252Fprofile_image%252F3833034%252F44fa15e0-8eb9-4843-a424-a4a7b3538f43.jpeg&w=3840&q=75)