Future Policy Approximation for Offline Reinforcement Learning Improves Mathematical Reasoning
arXiv cs.CL / 4/6/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that offline reinforcement learning for LLM mathematical reasoning is currently less effective than online RL due to “gradient entanglement” in long-horizon trajectories where correct and incorrect solutions overlap in tokens.
- It introduces Future Policy Approximation (FPA), which reweights offline RL gradients using an estimate of the future policy (computed via logit-space extrapolation) rather than the current policy, adding negligible overhead.
- The authors provide theoretical motivation by linking FPA to Optimistic Mirror Descent and explain a relationship to DPO, positioning the method within existing RLHF-style training frameworks.
- Experiments across three models and seven mathematical benchmarks show consistent gains over several strong offline baselines (including DPO, RPO, KTO, and vanilla offline RL).
- FPA is reported to stabilize long-horizon offline training where simpler objectives degrade, achieving accuracy comparable to online RLVR while using substantially less GPU compute.
Related Articles

Black Hat Asia
AI Business
How Bash Command Safety Analysis Works in AI Systems
Dev.to
How I Built an AI Agent That Earns USDC While I Sleep — A Complete Guide
Dev.to
How to Get Better Output from AI Tools (Without Burning Time and Tokens)
Dev.to
How I Added LangChain4j Without Letting It Take Over My Spring Boot App
Dev.to