Future Policy Approximation for Offline Reinforcement Learning Improves Mathematical Reasoning

arXiv cs.CL / 4/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that offline reinforcement learning for LLM mathematical reasoning is currently less effective than online RL due to “gradient entanglement” in long-horizon trajectories where correct and incorrect solutions overlap in tokens.
  • It introduces Future Policy Approximation (FPA), which reweights offline RL gradients using an estimate of the future policy (computed via logit-space extrapolation) rather than the current policy, adding negligible overhead.
  • The authors provide theoretical motivation by linking FPA to Optimistic Mirror Descent and explain a relationship to DPO, positioning the method within existing RLHF-style training frameworks.
  • Experiments across three models and seven mathematical benchmarks show consistent gains over several strong offline baselines (including DPO, RPO, KTO, and vanilla offline RL).
  • FPA is reported to stabilize long-horizon offline training where simpler objectives degrade, achieving accuracy comparable to online RLVR while using substantially less GPU compute.

Abstract

Reinforcement Learning (RL) has emerged as the key driver for post-training complex reasoning in Large Language Models (LLMs), yet online RL introduces significant instability and computational overhead. Offline RL offers a compelling alternative by decoupling inference from training; however, offline algorithms for reasoning remain under-optimized compared to their online counterparts. A central challenge is gradient entanglement: in long-horizon reasoning trajectories, correct and incorrect solutions share substantial token overlap, causing gradient updates from incorrect trajectories to suppress tokens critical for correct ones. We propose Future Policy Approximation (FPA), a simple method that weights gradients against an estimate of the future policy rather than the current one, enabling proactive gradient reweighting. This future policy is estimated via logit-space extrapolation with negligible overhead. We provide theoretical intuition for FPA through the lens of Optimistic Mirror Descent and further ground it through its connection to DPO. Evaluating FPA across three models and seven mathematical benchmarks, we demonstrate consistent improvements over strong offline baselines including DPO, RPO, KTO, and vanilla offline RL. FPA stabilizes long-horizon training where vanilla objectives degrade and achieves comparable accuracy to online RLVR at a fraction of its GPU hours.