Can LLMs Learn to Reason Robustly under Noisy Supervision?
arXiv cs.LG / 4/7/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies how Reinforcement Learning with Verifiable Rewards (RLVR) reasoning models behave when training signals include noisy labels, focusing on expert-scarcity settings where noise is unavoidable.
- It distinguishes “inactive” noisy labels (mainly reduce data efficiency) from “active” noisy labels (can be reinforced by the rollout process and skew the model toward incorrect reasoning distributions).
- Experiments reveal an Early Correctness Coherence effect, where accuracy on both clean and noisy samples improves similarly in early training even though noisy samples fall behind later.
- Motivated by this dynamic, the authors propose Online Label Refinement (OLR), which progressively corrects suspected noisy labels via majority-voted answers when rollout pass-rate trends and historical consistency conditions are satisfied.
- Across multiple math and general reasoning benchmarks under noise ratios of 0.1–0.9, OLR improves robustness, yielding average gains of about 3.6–3.9% in-distribution and 3.3–4.6% out-of-distribution.
Related Articles

Black Hat Asia
AI Business

Meta Superintelligence Lab Releases Muse Spark: A Multimodal Reasoning Model With Thought Compression and Parallel Agents
MarkTechPost

Chatbots are great at manipulating people to buy stuff, Princeton boffins find
The Register
I tested and ranked every ai companion app I tried and here's my honest breakdown
Reddit r/artificial

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to