LoopRPT: Reinforcement Pre-Training for Looped Language Models

arXiv cs.CL / 3/23/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

LoopRPT reframes next-token prediction as a next-token reasoning task for LoopLMs, enabling reinforcement signals to be applied directly to latent steps via an EMA teacher reference and noisy latent rollouts.
The approach targets intermediate latent representations, compressing effective reasoning into fewer iterations and improving per-step representation quality.
Experiments on the Ouro architecture across multiple model scales show LoopRPT achieves Pareto dominance in accuracy-computation trade-offs and delivers notable gains on hard tokens, highlighting improved early-stage reasoning.
The work proposes reinforcement pre-training as a principled paradigm for learning efficient latent reasoning in looped language models.

Abstract

Looped language models (LoopLMs) perform iterative latent computation to refine internal representations, offering a promising alternative to explicit chain-of-thought (CoT) reasoning. However, existing reinforcement learning (RL) paradigms primarily target output tokens, creating a structural mismatch with looped architectures whose reasoning unfolds implicitly. In this work, we propose LoopRPT, a reinforcement pre-training framework tailored for LoopLMs. By reframing next-token prediction as a next-token reasoning task, LoopRPT assigns reinforcement signals directly to latent steps using an EMA teacher reference and noisy latent rollouts. This formulation enables RL to directly shape intermediate representations, compressing effective reasoning into fewer iterations. We instantiate LoopRPT on the Ouro architecture across multiple model scales. Results demonstrate that LoopRPT consistently improves per-step representation quality, achieving Pareto dominance in accuracy-computation trade-offs. Notably, significant gains on hard tokens indicate that LoopRPT enhances early-stage reasoning rather than merely encouraging premature exits. Our findings highlight reinforcement pre-training as a principled paradigm for learning efficient latent reasoning in LoopLMs.