Limits of Difficulty Scaling: Hard Samples Yield Diminishing Returns in GRPO-Tuned SLMs

arXiv cs.LG / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper tests whether preference optimization (GRPO with LoRA) improves reasoning accuracy as math problem difficulty increases in smaller language models (up to 3B).
Results show accuracy plateaus for harder tiers, suggesting GRPO mostly reshapes output preferences rather than reliably expanding capability to solve the most difficult samples.
Training with GRPO on only lower-difficulty problems can match full-dataset accuracy across difficulty tiers while using about 45% of the training steps, indicating diminishing returns from including the hardest examples.
A cross-dataset effect is observed: a GSM8K-trained GRPO model performs better on MATH’s numeric subset than a MATH-trained GRPO model, with gains of roughly 5% at 1.5B and 3% at 3B.
The authors conclude that achievable improvements depend strongly on the base model’s initial reasoning competence and the target dataset’s difficulty distribution.

Abstract

Recent alignment work on Large Language Models (LLMs) suggests preference optimization can improve reasoning by shifting probability mass toward better solutions. We test this claim in a resource-constrained setting by applying GRPO with LoRA to SLMs (up to 3B) for math reasoning on GSM8K and MATH datasets with difficulty-stratified analyses. As problem difficulty increases, accuracy plateaus, revealing a capacity boundary: GRPO primarily reshapes output preferences without reliably improving hardest-tier solving. Consistent with this, training GRPO only on lower-difficulty problems matches full-dataset accuracy across difficulty tiers while using only ~45% training steps, indicating diminishing returns from harder samples in this regime. We also find a cross-dataset generalization effect: GSM8K-trained GRPO achieves higher accuracy on the numeric subset of MATH than MATH-trained GRPO, exceeding it by ~5% at 1.5B and by ~3% at 3B. We show that the best achievable gains depend strongly on the base model's prior reasoning competence and the dataset's difficulty profile.

Black Hat Asia

AI Business

Amazon CEO takes aim at Nvidia, Intel, Starlink, more in annual shareholder letter

TechCrunch

Why Anthropic’s new model has cybersecurity experts rattled

Reddit r/artificial

Does the AI 2027 paper still hold any legitimacy?

Reddit r/artificial

Why Most Productivity Systems Fail (And What to Do Instead)

Dev.to

Limits of Difficulty Scaling: Hard Samples Yield Diminishing Returns in GRPO-Tuned SLMs

Key Points

Abstract

Related Articles

Black Hat Asia

Amazon CEO takes aim at Nvidia, Intel, Starlink, more in annual shareholder letter

Why Anthropic’s new model has cybersecurity experts rattled

Does the AI 2027 paper still hold any legitimacy?

Why Most Productivity Systems Fail (And What to Do Instead)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer