Balancing the Reasoning Load: Difficulty-Differentiated Policy Optimization with Length Redistribution for Efficient and Robust Reinforcement Learning

arXiv cs.LG / 3/20/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces Difficulty-Differentiated Policy Optimization (DDPO), a reinforcement learning algorithm that differentiates optimization between simple and complex tasks to address overthinking and overconfidence in Large Reasoning Models (LRMs).
DDPO reduces output length for simpler tasks while expanding the exploration space for harder tasks to maintain or improve accuracy, balancing efficiency and performance.
The authors derive theoretical conditions for maximizing expected accuracy, showing that the length distribution should closely approximate the optimal length and be as concentrated as possible, using the difficulty-level average as a reference for length optimization.
Empirical results on in-domain and out-of-domain benchmarks show DDPO reduces average answer length by 12% and increases accuracy by 1.85% compared with GRPO, indicating a better accuracy-length trade-off.
The authors provide code for DDPO at https://github.com/Yinan-Xia/DDPO, enabling replication and practical use.

Abstract

Large Reasoning Models (LRMs) have shown exceptional reasoning capabilities, but they also suffer from the issue of overthinking, often generating excessively long and redundant answers. For problems that exceed the model's capabilities, LRMs tend to exhibit the overconfidence phenomenon, generating overly short but incorrect answers, which may contribute to suboptimal performance. To address these issues, we propose Difficulty-Differentiated Policy Optimization (DDPO), an efficient reinforcement learning algorithm that optimizes simple and complex tasks separately based on the overconfidence phenomenon. Specifically, it reduces the output length for simple tasks without compromising accuracy, while for complex tasks, it expands the exploration space to improve performance. We further derive the theoretical conditions for maximizing expected accuracy, which require the length distribution to closely approximate the optimal length and be as concentrated as possible. Based on these conditions, we propose using the difficulty-level average as a well-founded reference for length optimization. Extensive experiments on both in-domain and out-of-domain benchmarks validate the superiority and effectiveness of DDPO. Compared to GRPO, DDPO reduces the average answer length by 12% while improving accuracy by 1.85% across multiple benchmarks, achieving a better trade-off between accuracy and length. The code is available at https://github.com/Yinan-Xia/DDPO.

State of MCP Security 2026: We Scanned 15,923 AI Tools. Here's What We Found.

Dev.to

Data Augmentation Using GANs

Dev.to

Building Safety Guardrails for LLM Customer Service That Actually Work in Production

Dev.to

The New AI Agent Primitive: Why Policy Needs Its Own Language (And Why YAML and Rego Fall Short)

Dev.to

The Digital Paralegal: Amplifying Legal Teams with a Copilot Co-Worker

Dev.to

Balancing the Reasoning Load: Difficulty-Differentiated Policy Optimization with Length Redistribution for Efficient and Robust Reinforcement Learning

Key Points

Abstract

Related Articles

State of MCP Security 2026: We Scanned 15,923 AI Tools. Here's What We Found.

Data Augmentation Using GANs

Building Safety Guardrails for LLM Customer Service That Actually Work in Production

The New AI Agent Primitive: Why Policy Needs Its Own Language (And Why YAML and Rego Fall Short)

The Digital Paralegal: Amplifying Legal Teams with a Copilot Co-Worker

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer