StaRPO: Stability-Augmented Reinforcement Policy Optimization

arXiv cs.AI / 4/13/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

StaRPO is proposed as a reinforcement learning framework to improve large language model reasoning by optimizing not just final-answer correctness but also the stability of the reasoning process.
The method introduces two lightweight, computable stability metrics—Autocorrelation Function (ACF) for local step-to-step coherence and Path Efficiency (PE) for global goal-directedness along the reasoning trajectory.
StaRPO combines these stability rewards with standard task rewards to provide complementary, process-aware feedback during policy optimization.
Experiments report that ACF and PE correlate with logic errors on two backbone models and that StaRPO improves performance on four reasoning benchmarks, boosting both final-answer accuracy and logical stability.

Abstract

Reinforcement learning (RL) is effective in enhancing the accuracy of large language models in complex reasoning tasks. Existing RL policy optimization frameworks rely on final-answer correctness as feedback signals and rarely capture the internal logical structure of the reasoning process. Consequently, the models would generate fluent and semantically relevant responses but logically inconsistent, structurally erratic, or redundant. To this end, we propose StaRPO, a stability-augmented reinforcement learning framework that explicitly incorporates reasoning stability into the optimization objective. Our StaRPO decomposes stability into two computable lightweight metrics: the Autocorrelation Function (ACF) to evaluate local step-to-step coherence, and Path Efficiency (PE) to evaluate global goal-directedness of the reasoning trajectory. These stability rewards are combined with task rewards to provide complementary and process-aware feedback. We validate the effectiveness of using ACF and PE rewards by showing their correlation with logic errors on two backbone models. Experiments on four reasoning benchmarks show that StaRPO consistently outperforms compared baselines and can enhance both final-answer accuracy and logical stability.

Black Hat Asia

AI Business

Apple is building smart glasses without a display to serve as an AI wearable

THE DECODER

Why Fashion Trend Prediction Isn’t Enough Without Generative AI

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Chatbot vs Voicebot: The Real Business Decision Nobody Talks About

Dev.to

StaRPO: Stability-Augmented Reinforcement Policy Optimization

Key Points

Abstract

Related Articles

Black Hat Asia

Apple is building smart glasses without a display to serve as an AI wearable

Why Fashion Trend Prediction Isn’t Enough Without Generative AI

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Chatbot vs Voicebot: The Real Business Decision Nobody Talks About

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer