Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers

arXiv cs.CL / 4/10/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper investigates implicit reasoning in transformer models—how they combine rules or knowledge within a single forward pass—highlighting that standard transformers often fail at implicit multi-hop composition.
  • It proposes recurrent-depth transformers, which reuse the same transformer layers for iterative computation, and tests two compositional generalization settings: systematic generalization and depth extrapolation.
  • In controlled experiments with models trained from scratch, recurrent-depth transformers outperform vanilla transformers on both challenges, showing improved compositional generalization over parametric knowledge.
  • The authors find that systematic generalization emerges via a three-stage “grokking” process (moving from memorization to in-distribution generalization and then to systematic generalization), supported by mechanistic analysis.
  • For depth extrapolation, the study shows generalization to deeper hop counts can be enabled by increasing inference-time recurrence, but also identifies a key failure mode called “overthinking,” where excessive recurrence harms predictions.

Abstract

We study implicit reasoning, i.e. the ability to combine knowledge or rules within a single forward pass. While transformer-based large language models store substantial factual knowledge and rules, they often fail to compose this knowledge for implicit multi-hop reasoning, suggesting a lack of compositional generalization over their parametric knowledge. To address this limitation, we study recurrent-depth transformers, which enables iterative computation over the same transformer layers. We investigate two compositional generalization challenges under the implicit reasoning scenario: systematic generalization, i.e. combining knowledge that is never used for compositions during training, and depth extrapolation, i.e. generalizing from limited reasoning depth (e.g. training on up to 5-hop) to deeper compositions (e.g. 10-hop). Through controlled studies with models trained from scratch, we show that while vanilla transformers struggle with both generalization challenges, recurrent-depth transformers can effectively make such generalization. For systematic generalization, we find that this ability emerges through a three-stage grokking process, transitioning from memorization to in-distribution generalization and finally to systematic generalization, supported by mechanistic analysis. For depth extrapolation, we show that generalization beyond training depth can be unlocked by scaling inference-time recurrence, with more iterations enabling deeper reasoning. We further study how training strategies affect extrapolation, providing guidance on training recurrent-depth transformers, and identify a key limitation, overthinking, where excessive recurrence degrades predictions and limits generalization to very deep compositions.