FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

arXiv cs.LG / 3/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Introduces Future-KL Influenced Policy Optimization (FIPO), a reinforcement learning algorithm designed to mitigate reasoning bottlenecks in large language models by using discounted future-KL divergence in policy updates.
Replaces coarse-grained outcome-based rewards with a dense, token-level advantage that weights tokens by their influence on subsequent trajectory behavior, enabling more precise credit assignment.
Demonstrates empirical gains on the Qwen2.5-32B model, extending average chain-of-thought length from about 4,000 tokens to over 10,000 and boosting AIME 2024 Pass@1 from 50.0% to 58.0% (≈56% convergence), outperforming several baselines.
Open-sources its training system built on the verl framework, highlighting practical reproducibility and a path for evolving ORM-based algorithms toward better reasoning capability.

Abstract

We present Future-KL Influenced Policy Optimization (FIPO), a reinforcement learning algorithm designed to overcome reasoning bottlenecks in large language models. While GRPO style training scales effectively, it typically relies on outcome-based rewards (ORM) that distribute a global advantage uniformly across every token in a trajectory. We argue that this coarse-grained credit assignment imposes a performance ceiling by failing to distinguish critical logical pivots from trivial tokens. FIPO addresses this by incorporating discounted future-KL divergence into the policy update, creating a dense advantage formulation that re-weights tokens based on their influence on subsequent trajectory behavior. Empirically, FIPO enables models to break through the length stagnation seen in standard baselines. Evaluated on Qwen2.5-32B, FIPO extends the average chain-of-thought length from roughly 4,000 to over 10,000 tokens and increases AIME 2024 Pass@1 accuracy from 50.0% to a peak of 58.0% (converging at approximately 56.0\%). This outperforms both DeepSeek-R1-Zero-Math-32B (around 47.0%) and o1-mini (approximately 56.0%). Our results suggest that establishing dense advantage formulations is a vital path for evolving ORM-based algorithms to unlock the full reasoning potential of base models. We open-source our training system, built on the verl framework.

AgentDesk vs Hiring Another Consultant: A Cost Comparison

Dev.to

"Why Your AI Agent Needs a System 1"

Dev.to

When should we expect TurboQuant?

Reddit r/LocalLLaMA

AI as Your Customs Co-Pilot: Automating HS Code Chaos in Southeast Asia

Dev.to

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Dev.to

FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

Key Points

Abstract

Related Articles

AgentDesk vs Hiring Another Consultant: A Cost Comparison

"Why Your AI Agent Needs a System 1"

When should we expect TurboQuant?

AI as Your Customs Co-Pilot: Automating HS Code Chaos in Southeast Asia

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer