Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning

arXiv cs.LG / 4/24/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies why test-time reinforcement learning (TTRL) for math reasoning can be vulnerable to spurious reward signals caused by pseudo-label noise during inference-time adaptation.
  • It empirically finds an “ambiguity region” for responses with medium consistency, showing these cases are a primary source of reward noise and can further be amplified by group-relative advantage estimation.
  • To address this, the authors propose DDRL (Debiased and Denoised test-time Reinforcement Learning), which removes ambiguous samples via frequency-based sampling while keeping a balanced positive/negative set.
  • DDRL then applies debiased advantage estimation using fixed advantages and adds a consensus-based off-policy refinement step with rejection-sampled data for more stable updates.
  • Experiments across three large language models on multiple math reasoning benchmarks show DDRL consistently improves over existing TTRL baselines, and the authors plan to release the code soon.

Abstract

Test-time reinforcement learning (TTRL) always adapts models at inference time via pseudo-labeling, leaving it vulnerable to spurious optimization signals from label noise. Through an empirical study, we observe that responses with medium consistency form an ambiguity region and constitute the primary source of reward noise. Crucially, we find that such spurious signals can be even amplified through group-relative advantage estimation. Motivated by these findings, we propose a unified framework, Debiased and Denoised test-time Reinforcement Learning (DDRL), to mitigate spurious signals. Concretely, DDRL first applies a frequency-based sampling strategy to exclude ambiguous samples while maintaining a balanced set of positive and negative examples. It then adopts a debiased advantage estimation with fixed advantages, removing the bias introduced by group-relative policy optimization. Finally, DDRL incorporates a consensus-based off-policy refinement stage, which leverages the rejection-sampled dataset to enable efficient and stable model updates. Experiments on three large language models across multiple mathematical reasoning benchmarks demonstrate that DDRL consistently outperforms existing TTRL baselines. The code will soon be released at https://github.com/yuyongcan/DDRL.