LoopRPT: Reinforcement Pre-Training for Looped Language Models
arXiv cs.CL / 3/23/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- LoopRPT reframes next-token prediction as a next-token reasoning task for LoopLMs, enabling reinforcement signals to be applied directly to latent steps via an EMA teacher reference and noisy latent rollouts.
- The approach targets intermediate latent representations, compressing effective reasoning into fewer iterations and improving per-step representation quality.
- Experiments on the Ouro architecture across multiple model scales show LoopRPT achieves Pareto dominance in accuracy-computation trade-offs and delivers notable gains on hard tokens, highlighting improved early-stage reasoning.
- The work proposes reinforcement pre-training as a principled paradigm for learning efficient latent reasoning in looped language models.
Related Articles

Black Hat Asia
AI Business

"The Agent Didn't Decide Wrong. The Instructions Were Conflicting — and Nobody Noticed."
Dev.to
Top 5 LLM Gateway Alternatives After the LiteLLM Supply Chain Attack
Dev.to

Stop Counting Prompts — Start Reflecting on AI Fluency
Dev.to

Reliable Function Calling in Deeply Recursive Union Types: Fixing Qwen Models' Double-Stringify Bug
Dev.to