How Transformers Learn to Plan via Multi-Token Prediction
arXiv cs.AI / 4/15/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that while next-token prediction (NTP) is common for language models, it can miss global structure needed for reasoning, motivating multi-token prediction (MTP) instead.
- Empirical results show MTP beats NTP on synthetic graph path-finding and on reasoning benchmarks including Countdown and boolean satisfiability tasks.
- The authors provide a theoretical analysis using a simplified two-layer Transformer, proving that MTP leads to a two-stage reverse reasoning behavior: first attending to the end node, then reconstructing intermediate path nodes backward.
- This reverse planning effect is attributed to a gradient-decoupling property of MTP, which is presented as giving a cleaner and more effective training signal than NTP.
- Overall, the work suggests that multi-token training objectives can inherently bias optimization toward more robust and interpretable “reasoning circuits,” especially for planning-like tasks.
Related Articles

As China’s biotech firms shift gears, can AI floor the accelerator?
SCMP Tech

AI startup claims to automate app making but actually just uses humans
Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

"OpenAI Codex Just Got Computer Use, Image Gen, and 90 Plugins. 3 Things Nobody's Telling You."
Dev.to

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs HallucinationEvaluation
Dev.to