A Task-Centric Theory for Iterative Self-Improvement with Easy-to-Hard Curricula

arXiv stat.ML / 3/23/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper analyzes iterative self-improvement fine-tuning of autoregressive LLMs using reward-verified outputs and derives finite-sample guarantees for the expected reward.
It models each round as maximum-likelihood fine-tuning on a reward-filtered distribution and reveals a feedback loop in which better models can consume more data per iteration, enabling sustained improvement with eventual saturation.
Adopting a task-centric view with easy-to-hard curricula, the authors prove conditions on initialization, task difficulty, and budget under which curricula outperform training on fixed mixtures of tasks.
The theory is validated with Monte-Carlo simulations and experiments on a synthetic graph-based reasoning task and standard mathematical reasoning benchmarks.

Abstract

Iterative self-improvement fine-tunes an autoregressive large language model (LLM) on reward-verified outputs generated by the LLM itself. In contrast to the empirical success of self-improvement, the theoretical foundation of this generative, iterative procedure in a practical, finite-sample setting remains limited. We make progress toward this goal by modeling each round of self-improvement as maximum-likelihood fine-tuning on a reward-filtered distribution and deriving finite-sample guarantees for the expected reward. Our analysis reveals an explicit feedback loop where better models accept more data per iteration, supporting sustained self-improvement while explaining eventual saturation of such improvement. Adopting a task-centric view by considering reasoning tasks with multiple difficulty levels, we further prove quantifiable conditions on model initialization, task difficulty, and sample budget where easy-to-hard curricula provably achieve better guarantees than training on fixed mixtures of tasks. Our analyses are validated through Monte-Carlo simulations and experiments spanning a synthetic graph-based reasoning task and multiple standard mathematical reasoning benchmarks.

Is AI becoming a bubble, and could it end like the dot-com crash?

Reddit r/artificial

Externalizing State

Dev.to

I made a 'benchmark' where LLMs write code controlling units in a 1v1 RTS game.

Dev.to

My AI Does Not Have a Clock

Dev.to

How to settle on a coding LLM ? What parameters to watch out for ?

Reddit r/LocalLLaMA

A Task-Centric Theory for Iterative Self-Improvement with Easy-to-Hard Curricula

Key Points

Abstract

Related Articles

Is AI becoming a bubble, and could it end like the dot-com crash?

Externalizing State

I made a 'benchmark' where LLMs write code controlling units in a 1v1 RTS game.

My AI Does Not Have a Clock

How to settle on a coding LLM ? What parameters to watch out for ?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer