How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

arXiv cs.LG / 4/29/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that output-only supervision and RL-from-verifiable-rewards (RLVR) can stall for reasoning models when the initial success probability is small, especially at the RL “exploitation pole.”
  • It introduces a Tsallis-based loss continuum family (J_Q) that interpolates between RLVR (q=0) and log-marginal-likelihood over latent trajectories (q=1), with the key effect controlled by an instance reweighting/amplification factor P_{θ^{-q}}.
  • Under gradient flow analysis, the exploitation pole needs Ω(1/p0) time to escape cold start, while the density-estimation pole escapes in Θ(log(1/p0)), and intermediate q values trade faster escape against noise memorization.
  • Because P_θ is intractable, the authors propose two Monte Carlo training methods: Gradient-Amplified RL (GARL) and Posterior-Attenuated Fine-Tuning (PAFT), each with different bias/variance and semantic properties.
  • Experiments on FinQA, HotPotQA, and MuSiQue show GARL at q=0.75 substantially reduces cold-start stalling where GRPO fails, while in warm-start settings GARL can destabilize and PAFT at q=0.75 yields the best overall results (notably HotPotQA at 47.9 maj@16, +14.4 vs GRPO).

Abstract

Adapting reasoning models to new tasks during post-training with only output-level supervision stalls under reinforcement learning from verifiable rewards (RLVR) when the initial success probability p_0 is small. Using the Tsallis q-logarithm, we define a loss family J_Q that interpolates between RLVR (at q{=}0, the exploitation pole) and the log-marginal-likelihood over latent trajectories (at q{=}1, the density-estimation pole). All members share the same per-example gradient direction, differing only by a scalar amplification P_{\theta^{-q}} that reweights each instance independently of the learning rate. This amplification is the mechanism that addresses cold-start stalling: under gradient flow, the exploitation pole requires \Omega(\frac{1}{p_0}) time to escape cold start, while the density-estimation pole escapes in \Theta\big(\log(\frac{1}{p_0})\big); intermediate q trades escape speed against noise memorization. Because P_\theta is intractable, we derive two Monte Carlo estimators from the two factorizations of the gradient: Gradient-Amplified RL (GARL) samples from the prior and amplifies the RL gradient, and Posterior-Attenuated Fine-Tuning (PAFT) importance-resamples from the posterior and runs standard SFT. Both have bias O\big(\frac{q}{M P_{\theta}^{q+1}}\big); GARL has lower variance, PAFT has semantically coherent gradients. On FinQA, HotPotQA, and MuSiQue, GARL at q{=}0.75 substantially mitigates cold-start stalling, escaping cold start where GRPO fails entirely. In warm start, GARL at low q dominates FinQA where training is stable; on HotPotQA and MuSiQue, GARL destabilizes during training, and PAFT at q{=}0.75 provides stable gradients (best overall on HotPotQA at 47.9 maj@16, +14.4 over GRPO).