How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

arXiv cs.LG / 4/29/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that output-only supervision and RL-from-verifiable-rewards (RLVR) can stall for reasoning models when the initial success probability is small, especially at the RL “exploitation pole.”
It introduces a Tsallis-based loss continuum family (J_Q) that interpolates between RLVR (q=0) and log-marginal-likelihood over latent trajectories (q=1), with the key effect controlled by an instance reweighting/amplification factor P_{θ^{-q}}.
Under gradient flow analysis, the exploitation pole needs Ω(1/p0) time to escape cold start, while the density-estimation pole escapes in Θ(log(1/p0)), and intermediate q values trade faster escape against noise memorization.
Because P_θ is intractable, the authors propose two Monte Carlo training methods: Gradient-Amplified RL (GARL) and Posterior-Attenuated Fine-Tuning (PAFT), each with different bias/variance and semantic properties.
Experiments on FinQA, HotPotQA, and MuSiQue show GARL at q=0.75 substantially reduces cold-start stalling where GRPO fails, while in warm-start settings GARL can destabilize and PAFT at q=0.75 yields the best overall results (notably HotPotQA at 47.9 maj@16, +14.4 vs GRPO).

Abstract

Adapting reasoning models to new tasks during post-training with only output-level supervision stalls under reinforcement learning from verifiable rewards (RLVR) when the initial success probability

p_0

is small. Using the Tsallis

q

-logarithm, we define a loss family

J_Q

that interpolates between RLVR (at

q{=}0

, the exploitation pole) and the log-marginal-likelihood over latent trajectories (at

q{=}1

, the density-estimation pole). All members share the same per-example gradient direction, differing only by a scalar amplification

P_{\theta^{-q}}

that reweights each instance independently of the learning rate. This amplification is the mechanism that addresses cold-start stalling: under gradient flow, the exploitation pole requires

\Omega(\frac{1}{p_0})

time to escape cold start, while the density-estimation pole escapes in

\Theta\big(\log(\frac{1}{p_0})\big)

; intermediate

q

trades escape speed against noise memorization. Because

P_\theta

is intractable, we derive two Monte Carlo estimators from the two factorizations of the gradient: Gradient-Amplified RL (GARL) samples from the prior and amplifies the RL gradient, and Posterior-Attenuated Fine-Tuning (PAFT) importance-resamples from the posterior and runs standard SFT. Both have bias

O\big(\frac{q}{M P_{\theta}^{q+1}}\big)

; GARL has lower variance, PAFT has semantically coherent gradients. On FinQA, HotPotQA, and MuSiQue, GARL at

q{=}0.75

substantially mitigates cold-start stalling, escaping cold start where GRPO fails entirely. In warm start, GARL at low

q

dominates FinQA where training is stable; on HotPotQA and MuSiQue, GARL destabilizes during training, and PAFT at

q{=}0.75

provides stable gradients (best overall on HotPotQA at 47.9 maj@16,

+14.4

over GRPO).

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/29DailyView insight →

How I Use AI Agents to Maintain a Living Knowledge Base for My Team

Dev.to

IK_LLAMA now supports Qwen3.5 MTP Support :O

Reddit r/LocalLLaMA

OpenAI models, Codex, and Managed Agents come to AWS

Dev.to

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Dev.to

Vertical SaaS for Startups 2026: Building a Niche AI-First Product

Dev.to

How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

Key Points

Abstract

💡 Insights using this article

Related Articles

How I Use AI Agents to Maintain a Living Knowledge Base for My Team

IK_LLAMA now supports Qwen3.5 MTP Support :O

OpenAI models, Codex, and Managed Agents come to AWS

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Vertical SaaS for Startups 2026: Building a Niche AI-First Product

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer