Limits of Difficulty Scaling: Hard Samples Yield Diminishing Returns in GRPO-Tuned SLMs

arXiv cs.LG / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper tests whether preference optimization (GRPO with LoRA) improves reasoning accuracy as math problem difficulty increases in smaller language models (up to 3B).
  • Results show accuracy plateaus for harder tiers, suggesting GRPO mostly reshapes output preferences rather than reliably expanding capability to solve the most difficult samples.
  • Training with GRPO on only lower-difficulty problems can match full-dataset accuracy across difficulty tiers while using about 45% of the training steps, indicating diminishing returns from including the hardest examples.
  • A cross-dataset effect is observed: a GSM8K-trained GRPO model performs better on MATH’s numeric subset than a MATH-trained GRPO model, with gains of roughly 5% at 1.5B and 3% at 3B.
  • The authors conclude that achievable improvements depend strongly on the base model’s initial reasoning competence and the target dataset’s difficulty distribution.

Abstract

Recent alignment work on Large Language Models (LLMs) suggests preference optimization can improve reasoning by shifting probability mass toward better solutions. We test this claim in a resource-constrained setting by applying GRPO with LoRA to SLMs (up to 3B) for math reasoning on GSM8K and MATH datasets with difficulty-stratified analyses. As problem difficulty increases, accuracy plateaus, revealing a capacity boundary: GRPO primarily reshapes output preferences without reliably improving hardest-tier solving. Consistent with this, training GRPO only on lower-difficulty problems matches full-dataset accuracy across difficulty tiers while using only ~45% training steps, indicating diminishing returns from harder samples in this regime. We also find a cross-dataset generalization effect: GSM8K-trained GRPO achieves higher accuracy on the numeric subset of MATH than MATH-trained GRPO, exceeding it by ~5% at 1.5B and by ~3% at 3B. We show that the best achievable gains depend strongly on the base model's prior reasoning competence and the dataset's difficulty profile.