Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Sequence-Level Likelihood

arXiv cs.CL / 4/15/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces TEPO (Token-Level Policy Optimization) to improve Group Relative Policy Optimization (GRPO) for LLMs under token-level sparse-reward settings common in chain-of-thought mathematical reasoning.
TEPO links group-level rewards to token-level learning by aggregating token updates using sequence-level likelihood, addressing how sparse token rewards are assigned during training.
It adds a token-level KL-divergence mask constraint that applies to tokens with positive advantages and decreasing entropy, aiming to prevent abrupt policy updates that can cause entropy collapse or degradation.
Experiments report state-of-the-art results on mathematical reasoning benchmarks and improved training stability, including a claimed 50% reduction in convergence time versus GRPO/DAPO.

Abstract

Group Relative Policy Optimization (GRPO) has significantly advanced the reasoning ability of large language models (LLMs), particularly in their mathemat ical reasoning performance. However, GRPO and related entropy regularization methods still struggle with token-level sparse-rewards, which is an inherent chal lenge in chain-of-thought (CoT) reasoning. These approaches often rely on undifferen tiated token-level entropy regularization, which easily leads to entropy collapse or model degradation under sparse token rewards. In this work, we propose TEPO, a novel token-level framework that (1) leverages sequence-level likelihood to link group-level rewards with individual tokens via token-level aggregation, and (2) introduces a token-level KL-Divergence mask constraint that targets tokens with positive advantages and decreasing entropy to mitigate abrupt policy updates. Experiments demonstrate that TEPO not only achieves state-of-the-art performance on mathematical reasoning benchmarks but also markedly enhances training stability, reducing convergence time by 50% compared with GRPO/DAPO.

RAG in Practice — Part 4: Chunking, Retrieval, and the Decisions That Break RAG

Dev.to

Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]

Reddit r/MachineLearning

How AI Interview Assistants Are Changing Job Preparation in 2026

Dev.to

Consciousness in Artificial Intelligence: Insights from the Science ofConsciousness

Dev.to

NEW PROMPT INJECTION

Dev.to

Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Sequence-Level Likelihood

Key Points

Abstract

Related Articles

RAG in Practice — Part 4: Chunking, Retrieval, and the Decisions That Break RAG

Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]

How AI Interview Assistants Are Changing Job Preparation in 2026

Consciousness in Artificial Intelligence: Insights from the Science ofConsciousness

NEW PROMPT INJECTION

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer