Can LLMs Learn to Reason Robustly under Noisy Supervision?

arXiv cs.LG / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies how Reinforcement Learning with Verifiable Rewards (RLVR) reasoning models behave when training signals include noisy labels, focusing on expert-scarcity settings where noise is unavoidable.
It distinguishes “inactive” noisy labels (mainly reduce data efficiency) from “active” noisy labels (can be reinforced by the rollout process and skew the model toward incorrect reasoning distributions).
Experiments reveal an Early Correctness Coherence effect, where accuracy on both clean and noisy samples improves similarly in early training even though noisy samples fall behind later.
Motivated by this dynamic, the authors propose Online Label Refinement (OLR), which progressively corrects suspected noisy labels via majority-voted answers when rollout pass-rate trends and historical consistency conditions are satisfied.
Across multiple math and general reasoning benchmarks under noise ratios of 0.1–0.9, OLR improves robustness, yielding average gains of about 3.6–3.9% in-distribution and 3.3–4.6% out-of-distribution.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) effectively trains reasoning models that rely on abundant perfect labels, but its vulnerability to unavoidable noisy labels due to expert scarcity remains critically underexplored. In this work, we take the first step toward a systematic analysis of noisy label mechanisms in RLVR. In contrast to supervised classification, most RLVR algorithms incorporate a rollout-based condition: a label's influence on training is contingent on whether the current policy can generate rollouts that realize it, a property that naturally extends to noisy labels. Based on this observation, we distinguish two types of noise: inactive noisy labels, which reduce data efficiency, and active noisy labels, which are reinforced and risk skewing the model toward incorrect distributions. From experiments on training with noisy samples, we identify an Early Correctness Coherence phenomenon: although noisy samples begin to lag behind in later stages, accuracy on both clean and noisy samples increases similarly in early training. Motivated by this dynamic, we propose Online Label Refinement (OLR), which progressively corrects potentially noisy labels with majority-voted answers when two conditions hold: a positive slope in the majority answer's rollout pass rate and stable historical consistency across updates, enabling gradual self-correction as the policy improves. We evaluate OLR on six in-distribution mathematical reasoning benchmarks (AIME24/25, AMC, MATH-500, Minerva, and Olympiad) and three out-of-distribution tasks (ARC-c, GPQA-diamond, and MMLU-pro). Across noise ratios from 0.1 to 0.9, OLR consistently improves robustness under both inactive and active noisy-label settings, achieving average gains of 3.6% to 3.9% on in-distribution benchmarks and 3.3% to 4.6% on out-of-distribution evaluations.

Black Hat Asia

AI Business

Meta Superintelligence Lab Releases Muse Spark: A Multimodal Reasoning Model With Thought Compression and Parallel Agents

MarkTechPost

Chatbots are great at manipulating people to buy stuff, Princeton boffins find

The Register

I tested and ranked every ai companion app I tried and here's my honest breakdown

Reddit r/artificial

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Can LLMs Learn to Reason Robustly under Noisy Supervision?

Key Points

Abstract

Related Articles

Black Hat Asia

Meta Superintelligence Lab Releases Muse Spark: A Multimodal Reasoning Model With Thought Compression and Parallel Agents

Chatbots are great at manipulating people to buy stuff, Princeton boffins find

I tested and ranked every ai companion app I tried and here's my honest breakdown

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer