PubSwap: Public-Data Off-Policy Coordination for Federated RLVR

arXiv cs.LG / 4/15/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces PubSwap, a federated RLVR (reinforcement learning from verifiable rewards) framework aimed at scaling reasoning post-training beyond centralized settings.
It reduces communication cost and client drift by using LoRA-based local adaptation while periodically performing off-policy coordination using a small shared public dataset.
During public-data steps, PubSwap selectively replaces locally incorrect responses with globally correct ones, using shared response-level signals to keep clients better aligned with a global objective.
The authors report consistent improvements over standard baselines on mathematical and medical reasoning benchmarks, suggesting the approach is broadly effective.
Overall, the work presents a practical “recipe” for federated reasoning post-training that combines low-rank updates with lightweight public-data anchoring without exposing private data.

Abstract

Reasoning post-training with reinforcement learning from verifiable rewards (RLVR) is typically studied in centralized settings, yet many realistic applications involve decentralized private data distributed across organizations. Federated training is a natural solution, but scaling RLVR in this regime is challenging: full-model synchronization is expensive, and performing many local steps can cause severe client drift under heterogeneous data. We propose a federated RLVR framework that combines LoRA-based local adaptation with public-data-based off-policy steps to improve both communication efficiency and cross-client coordination. In particular, a small shared public dataset is used to periodically exchange and reuse response-level training signals across organizations, providing a lightweight anchor toward a more globally aligned objective without exposing private data. Our method selectively replaces locally incorrect responses with globally correct ones during public-data steps, thereby keeping training closer to the local policy while still benefiting from cross-client coordination. Across mathematical and medical reasoning benchmarks and models, our method consistently improves over standard baselines. Our results highlight a simple and effective recipe for federated reasoning post-training: combining low-rank communication with limited public-data coordination.

Are gamers being used as free labeling labor? The rise of "Simulators" that look like AI training grounds [D]

Reddit r/MachineLearning

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Failure to Reproduce Modern Paper Claims [D]

Reddit r/MachineLearning

Why don’t they just use Mythos to fix all the bugs in Claude Code?

Reddit r/LocalLLaMA

PubSwap: Public-Data Off-Policy Coordination for Federated RLVR

Key Points

Abstract

Related Articles

Are gamers being used as free labeling labor? The rise of "Simulators" that look like AI training grounds [D]

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Failure to Reproduce Modern Paper Claims [D]

Why don’t they just use Mythos to fix all the bugs in Claude Code?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer