JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR

arXiv cs.AI / 4/29/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces JURY-RL, a label-free reinforcement learning with verifiable rewards (RLVR) framework designed to reduce annotation and reward-specification costs for LLM reasoning.
JURY-RL separates “answer proposal” from “reward disposal” by using majority/plurality-vote rollouts to propose candidates, then relying on a formal verifier to decide whether the proposed answer is eligible for positive reward.
If the verifier cannot conclude, JURY-RL falls back to ResZero to discard the unverifiable consensus proposal and instead provide a zero-mean, variance-preserving reward signal distributed over residual answers.
Experiments on mathematical-data-trained backbone models show consistent gains over other label-free baselines on mathematical reasoning benchmarks and competitive transfer to code generation and general benchmarks.
The approach achieves pass@1 comparable to supervised ground-truth training while improving generalization via higher pass@k and increased response diversity.

Abstract

Reinforcement learning with verifiable rewards (RLVR) enhances the reasoning of large language models (LLMs), but standard RLVR often depends on human-annotated answers or carefully curated reward specifications. In machine-checkable domains, label-free alternatives such as majority voting or LLM-as-a-judge remove annotation cost but can introduce false positives that destabilize training. We introduce JURY-RL, a label-free RLVR framework that decouples answer proposal from reward disposal: votes from model rollouts propose a candidate answer, and a formal verifier determines whether that candidate can receive positive reward. Concretely, only rollouts matching the plurality-voted answer are rewarded when that answer is successfully verified in Lean. When verification is inconclusive, we invoke ResZero (Residual-Zero), a fallback reward that discards the unverified plurality proposal and redistributes a zero-mean, variance-preserving signal over the residual answers. This design maintains a stable optimization gradient without reinforcing unverifiable consensus. Across three backbone models trained on mathematical data, JURY-RL consistently outperforms other label-free baselines on mathematical reasoning benchmarks and transfers competitively to code generation and general benchmarks. It attains pass@1 performance comparable to supervised ground-truth training, with superior generalization demonstrated by higher pass@k and response diversity.

Can AI Predict Pollution Before It Happens? The Smart Solution to an Old Problem

Dev.to

THE FIFTH TRANSMISSION: THE GRADIENT IS THE GOVERNMENT

Reddit r/artificial

Looking for feedback on OpenVidya: an open-source AI classroom layer for NCERT/CBSE [R]

Reddit r/MachineLearning

RAG Series (1): Why LLMs Need External Memory

Dev.to

One Open Source Project a Day (No. 54): Warp - The AI-Native Rust Terminal

Dev.to

JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR

Key Points

Abstract

Related Articles

Can AI Predict Pollution Before It Happens? The Smart Solution to an Old Problem

THE FIFTH TRANSMISSION: THE GRADIENT IS THE GOVERNMENT

Looking for feedback on OpenVidya: an open-source AI classroom layer for NCERT/CBSE [R]

RAG Series (1): Why LLMs Need External Memory

One Open Source Project a Day (No. 54): Warp - The AI-Native Rust Terminal

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer