Dual Consensus: Escaping from Spurious Majority in Unsupervised RLVR via Two-Stage Vote Mechanism

arXiv cs.LG / 3/18/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Dual Consensus Reinforcement Learning (DCRL) is proposed as a self-supervised training method to mitigate convergence on spurious majority in unsupervised RLVR for large language models.
It introduces a two-stage vote mechanism where the model first acts as an anchor to produce dominant responses and then as an explorer to generate diverse auxiliary signals via a temporary unlearning process.
The final training target is the harmonic mean of the anchor and explorer signals, and the approach operates without external models or supervision.
Across eight benchmarks, DCRL improves Pass@1 over majority vote and yields more stable training dynamics, indicating a scalable path for stronger reasoning without labeled data.

Abstract

Current label-free RLVR approaches for large language models (LLMs), such as TTRL and Self-reward, have demonstrated effectiveness in improving the performance of LLMs on complex reasoning tasks. However, these methods rely heavily on accurate pseudo-label estimation and converge on spurious yet popular answers, thereby trapping in a dominant mode and limiting further improvements. Building on this, we propose Dual Consensus Reinforcement Learning (DCRL), a novel self-supervised training method which is capable of generating more reliable learning signals through a two-stage consensus mechanism. The model initially acts as an anchor, producing dominant responses; then it serves as an explorer, generating diverse auxiliary signals via a temporary unlearning process. The final training target is derived from the harmonic mean of these two signal sets. Notably, the process operates entirely without external models or supervision. Across eight benchmarks and diverse domains, DCRL consistently improves Pass@1 over majority vote while yielding more stable training dynamics. These results demonstrate that DCRL establishes a scalable path toward stronger reasoning without labels.

[R] Combining Identity Anchors + Permission Hierarchies achieves 100% refusal in abliterated LLMs — system prompt only, no fine-tuning

Reddit r/MachineLearning

How I Built an AI SDR Agent That Finds Leads and Writes Personalized Cold Emails

Dev.to

Complete Guide: How To Make Money With Ai

Dev.to

I Analyzed My Portfolio with AI and Scored 53/100 — Here's How I Fixed It to 85+

Dev.to

The Demethylation

Dev.to

Dual Consensus: Escaping from Spurious Majority in Unsupervised RLVR via Two-Stage Vote Mechanism

Key Points

Abstract

Related Articles

[R] Combining Identity Anchors + Permission Hierarchies achieves 100% refusal in abliterated LLMs — system prompt only, no fine-tuning

How I Built an AI SDR Agent That Finds Leads and Writes Personalized Cold Emails

Complete Guide: How To Make Money With Ai

I Analyzed My Portfolio with AI and Scored 53/100 — Here's How I Fixed It to 85+

The Demethylation

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer