Can LLMs Learn to Reason Robustly under Noisy Supervision?

arXiv cs.LG / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies how Reinforcement Learning with Verifiable Rewards (RLVR) reasoning models behave when training signals include noisy labels, focusing on expert-scarcity settings where noise is unavoidable.
  • It distinguishes “inactive” noisy labels (mainly reduce data efficiency) from “active” noisy labels (can be reinforced by the rollout process and skew the model toward incorrect reasoning distributions).
  • Experiments reveal an Early Correctness Coherence effect, where accuracy on both clean and noisy samples improves similarly in early training even though noisy samples fall behind later.
  • Motivated by this dynamic, the authors propose Online Label Refinement (OLR), which progressively corrects suspected noisy labels via majority-voted answers when rollout pass-rate trends and historical consistency conditions are satisfied.
  • Across multiple math and general reasoning benchmarks under noise ratios of 0.1–0.9, OLR improves robustness, yielding average gains of about 3.6–3.9% in-distribution and 3.3–4.6% out-of-distribution.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) effectively trains reasoning models that rely on abundant perfect labels, but its vulnerability to unavoidable noisy labels due to expert scarcity remains critically underexplored. In this work, we take the first step toward a systematic analysis of noisy label mechanisms in RLVR. In contrast to supervised classification, most RLVR algorithms incorporate a rollout-based condition: a label's influence on training is contingent on whether the current policy can generate rollouts that realize it, a property that naturally extends to noisy labels. Based on this observation, we distinguish two types of noise: inactive noisy labels, which reduce data efficiency, and active noisy labels, which are reinforced and risk skewing the model toward incorrect distributions. From experiments on training with noisy samples, we identify an Early Correctness Coherence phenomenon: although noisy samples begin to lag behind in later stages, accuracy on both clean and noisy samples increases similarly in early training. Motivated by this dynamic, we propose Online Label Refinement (OLR), which progressively corrects potentially noisy labels with majority-voted answers when two conditions hold: a positive slope in the majority answer's rollout pass rate and stable historical consistency across updates, enabling gradual self-correction as the policy improves. We evaluate OLR on six in-distribution mathematical reasoning benchmarks (AIME24/25, AMC, MATH-500, Minerva, and Olympiad) and three out-of-distribution tasks (ARC-c, GPQA-diamond, and MMLU-pro). Across noise ratios from 0.1 to 0.9, OLR consistently improves robustness under both inactive and active noisy-label settings, achieving average gains of 3.6% to 3.9% on in-distribution benchmarks and 3.3% to 4.6% on out-of-distribution evaluations.

Can LLMs Learn to Reason Robustly under Noisy Supervision? | AI Navigate