Thinking into the Future: Latent Lookahead Training for Transformers

arXiv cs.CL / 3/24/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that next-token autoregressive training makes models commit to a single continuation at every step and allocates equal compute per token, which can limit exploration and expressiveness.
  • It introduces “latent lookahead,” a training strategy where the model performs multi-step lookahead in latent space (recursively feeding hidden states) at selected sequence positions rather than sampling discrete future tokens.
  • The method supervises intermediate latent predictions against the next ground-truth tokens, explicitly encouraging the model to “think ahead” and refine what it will generate next.
  • Experiments show latent lookahead substantially improves performance over both autoregressive and non-autoregressive baselines on planning-heavy tasks including maze solving, Sudoku, and ProsQA, where foresight matters.

Abstract

Autoregressive language models trained with next-token prediction generate text by sampling one discrete token at a time. Although very scalable, this objective forces the model to commit at every step, preventing it from exploring or reflecting upon multiple plausible continuations. Furthermore, the compute allocation across tokens is uniform; every token is formed based on a single forward-pass, potentially limiting the model's expressiveness in cases where difficult tokens require inherently more compute. Towards addressing these limitations, we introduce latent lookahead, a training strategy that enables models to "think" before generating: at selected positions in the sequence, before committing to the next token, the model performs a multi-step lookahead in latent space. More precisely, instead of sampling future tokens, we leverage the network's latent space by recursively feeding its hidden states back into the context for \tau steps, investing more compute on predicting that token. This produces \tau latent predictions that are supervised against the next \tau ground-truth tokens, encouraging the model to "lookahead" and refine its prediction. We show that latent lookahead substantially outperforms both autoregressive and non-autoregressive baselines on planning tasks such as maze solving, Sudoku, and ProsQA, where foresight is essential.