PubSwap: Public-Data Off-Policy Coordination for Federated RLVR

arXiv cs.LG / 4/15/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces PubSwap, a federated RLVR (reinforcement learning from verifiable rewards) framework aimed at scaling reasoning post-training beyond centralized settings.
  • It reduces communication cost and client drift by using LoRA-based local adaptation while periodically performing off-policy coordination using a small shared public dataset.
  • During public-data steps, PubSwap selectively replaces locally incorrect responses with globally correct ones, using shared response-level signals to keep clients better aligned with a global objective.
  • The authors report consistent improvements over standard baselines on mathematical and medical reasoning benchmarks, suggesting the approach is broadly effective.
  • Overall, the work presents a practical “recipe” for federated reasoning post-training that combines low-rank updates with lightweight public-data anchoring without exposing private data.

Abstract

Reasoning post-training with reinforcement learning from verifiable rewards (RLVR) is typically studied in centralized settings, yet many realistic applications involve decentralized private data distributed across organizations. Federated training is a natural solution, but scaling RLVR in this regime is challenging: full-model synchronization is expensive, and performing many local steps can cause severe client drift under heterogeneous data. We propose a federated RLVR framework that combines LoRA-based local adaptation with public-data-based off-policy steps to improve both communication efficiency and cross-client coordination. In particular, a small shared public dataset is used to periodically exchange and reuse response-level training signals across organizations, providing a lightweight anchor toward a more globally aligned objective without exposing private data. Our method selectively replaces locally incorrect responses with globally correct ones during public-data steps, thereby keeping training closer to the local policy while still benefiting from cross-client coordination. Across mathematical and medical reasoning benchmarks and models, our method consistently improves over standard baselines. Our results highlight a simple and effective recipe for federated reasoning post-training: combining low-rank communication with limited public-data coordination.