AI Navigate

Dual Consensus: Escaping from Spurious Majority in Unsupervised RLVR via Two-Stage Vote Mechanism

arXiv cs.LG / 3/18/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • Dual Consensus Reinforcement Learning (DCRL) is proposed as a self-supervised training method to mitigate convergence on spurious majority in unsupervised RLVR for large language models.
  • It introduces a two-stage vote mechanism where the model first acts as an anchor to produce dominant responses and then as an explorer to generate diverse auxiliary signals via a temporary unlearning process.
  • The final training target is the harmonic mean of the anchor and explorer signals, and the approach operates without external models or supervision.
  • Across eight benchmarks, DCRL improves Pass@1 over majority vote and yields more stable training dynamics, indicating a scalable path for stronger reasoning without labeled data.

Abstract

Current label-free RLVR approaches for large language models (LLMs), such as TTRL and Self-reward, have demonstrated effectiveness in improving the performance of LLMs on complex reasoning tasks. However, these methods rely heavily on accurate pseudo-label estimation and converge on spurious yet popular answers, thereby trapping in a dominant mode and limiting further improvements. Building on this, we propose Dual Consensus Reinforcement Learning (DCRL), a novel self-supervised training method which is capable of generating more reliable learning signals through a two-stage consensus mechanism. The model initially acts as an anchor, producing dominant responses; then it serves as an explorer, generating diverse auxiliary signals via a temporary unlearning process. The final training target is derived from the harmonic mean of these two signal sets. Notably, the process operates entirely without external models or supervision. Across eight benchmarks and diverse domains, DCRL consistently improves Pass@1 over majority vote while yielding more stable training dynamics. These results demonstrate that DCRL establishes a scalable path toward stronger reasoning without labels.