How Transformers Learn to Plan via Multi-Token Prediction

arXiv cs.AI / 4/15/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that while next-token prediction (NTP) is common for language models, it can miss global structure needed for reasoning, motivating multi-token prediction (MTP) instead.
Empirical results show MTP beats NTP on synthetic graph path-finding and on reasoning benchmarks including Countdown and boolean satisfiability tasks.
The authors provide a theoretical analysis using a simplified two-layer Transformer, proving that MTP leads to a two-stage reverse reasoning behavior: first attending to the end node, then reconstructing intermediate path nodes backward.
This reverse planning effect is attributed to a gradient-decoupling property of MTP, which is presented as giving a cleaner and more effective training signal than NTP.
Overall, the work suggests that multi-token training objectives can inherently bias optimization toward more robust and interpretable “reasoning circuits,” especially for planning-like tasks.

Abstract

While next-token prediction (NTP) has been the standard objective for training language models, it often struggles to capture global structure in reasoning tasks. Multi-token prediction (MTP) has recently emerged as a promising alternative, yet its underlying mechanisms remain poorly understood. In this paper, we study how MTP facilitates reasoning, with a focus on planning. Empirically, we show that MTP consistently outperforms NTP on both synthetic graph path-finding tasks and more realistic reasoning benchmarks, such as Countdown and boolean satisfiability problems. Theoretically, we analyze a simplified two-layer Transformer on a star graph task. We prove that MTP induces a two-stage reverse reasoning process: the model first attends to the end node and then reconstructs the path by tracing intermediate nodes backward. This behavior arises from a gradient decoupling property of MTP, which provides a cleaner training signal compared to NTP. Ultimately, our results highlight how multi-token objectives inherently bias optimization toward robust and interpretable reasoning circuits.

As China’s biotech firms shift gears, can AI floor the accelerator?

SCMP Tech

AI startup claims to automate app making but actually just uses humans

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

"OpenAI Codex Just Got Computer Use, Image Gen, and 90 Plugins. 3 Things Nobody's Telling You."

Dev.to

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs HallucinationEvaluation

Dev.to

How Transformers Learn to Plan via Multi-Token Prediction

Key Points

Abstract

Related Articles

As China’s biotech firms shift gears, can AI floor the accelerator?

AI startup claims to automate app making but actually just uses humans

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

"OpenAI Codex Just Got Computer Use, Image Gen, and 90 Plugins. 3 Things Nobody's Telling You."

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs HallucinationEvaluation

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer