Do Post-Training Algorithms Actually Differ? A Controlled Study Across Model Scales Uncovers Scale-Dependent Ranking Inversions

arXiv cs.AI / 3/23/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces OXRL, a unified framework implementing 51 post-training algorithms to enable apples-to-apples evaluation across 8 algorithms, 4 model scales (0.5B–7B), and multiple evaluation domains.
It reveals that algorithm rankings are scale-dependent, with online RL (SGRPO) leading at 1.5B but SimPO becoming the top method at 7B, indicating scale-driven ranking inversions.
Modifying loss functions yields negligible gains across 20 DPO variants; the only significant outlier is SimPO, which performs worse.
Algorithm leverage is highly task-specific, with large performance gaps on GSM8K collapsing on MATH and general-domain benchmarks, suggesting most impact occurs within the training distribution.
The authors release all code, configurations, and evaluation data as a living community benchmark for ongoing, apples-to-apples comparisons.

Abstract

Post-training alignment has produced dozens of competing algorithms -- DPO, SimPO, KTO, GRPO, and others -- yet practitioners lack controlled comparisons to guide algorithm selection. We present OXRL, a unified framework implementing 51 post-training algorithms with identical infrastructure, enabling the first large-scale apples-to-apples evaluation. Our study spans 8 algorithms across 4 model scales (0.5B--7B), 3 evaluation domains, and a 20-variant DPO taxonomy (100 runs at 1.5B, 5 seeds each), totaling

\sim

240 training runs on H100 GPUs. Three headline findings emerge. (1)~Algorithm rankings are unstable across scale: at 1.5B, online RL (SGRPO) tops all methods at 58.0\%~

\pm

0.57 on GSM8K; by 7B, the worst small-scale method (SimPO) becomes the best (85.8\%), a complete ranking inversion driven by model scale rather than LoRA regularization (confirmed via 2

\times

2 factorial). (2)~Loss function modifications yield negligible gains: none of 20 DPO variants significantly outperform vanilla DPO after Bonferroni correction; the sole significant outlier, SimPO, is worse (

-

11.5~pp,

p < 10^{-4}

). (3)~Algorithm leverage is task-specific: the 19.3~pp GSM8K spread collapses to 0.54~pp on MATH (

36\times

) and 0.47~pp on general-domain benchmarks (

41\times

), confirming that algorithm choice matters primarily within the training distribution. These findings yield a hierarchy of leverage for practitioners: model scale (

{\sim}

50~pp)

\gg

training paradigm (

{\sim}

10~pp)

\gg

online vs.\ offline (

{\sim}

9~pp)

\gg

loss function (

{\sim}

1~pp). We release all code, configs, and evaluation data as a living community benchmark.

The Security Gap in MCP Tool Servers (And What I Built to Fix It)

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

I made a new programming language to get better coding with less tokens.

Dev.to

RSA Conference 2026: The Week Vibe Coding Security Became Impossible to Ignore

Dev.to

Adversarial AI framework reveals mechanisms behind impaired consciousness and a potential therapy

Reddit r/artificial

Do Post-Training Algorithms Actually Differ? A Controlled Study Across Model Scales Uncovers Scale-Dependent Ranking Inversions

Key Points

Abstract

Related Articles

The Security Gap in MCP Tool Servers (And What I Built to Fix It)

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

I made a new programming language to get better coding with less tokens.

RSA Conference 2026: The Week Vibe Coding Security Became Impossible to Ignore

Adversarial AI framework reveals mechanisms behind impaired consciousness and a potential therapy

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer