SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology

arXiv cs.AI / 3/31/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • Reinforcement learning for large reasoning models is often constrained by the need for verifiable rewards or labeled supervision, which limits performance in open-ended domains with ambiguous correctness.
  • The paper proposes SARL (Structure Aware Reinforcement Learning), a label-free RL framework that builds a per-response Reasoning Map from intermediate thinking steps and rewards its “small world” topology to shift learning from the final answer toward the reasoning path.
  • SARL aims to produce reasoning trajectories that are locally coherent while also being globally efficient, improving general reasoning rather than optimizing for early exploitation.
  • Experiments on Qwen3-4B report that SARL outperforms ground-truth-based RL and prior label-free RL baselines, with sizable gains on both math tasks and open-ended tasks.
  • The results also indicate training stability and improved exploration/generalization, evidenced by lower KL divergence and higher policy entropy compared with baselines.

Abstract

Reinforcement learning has become central to improving large reasoning models, but its success still relies heavily on verifiable rewards or labeled supervision. This limits its applicability to open ended domains where correctness is ambiguous and cannot be verified. Moreover, reasoning trajectories remain largely unconstrained, and optimization towards final answer can favor early exploitation over generalization. In this work, we ask whether general reasoning ability can be improved by teaching models how to think (the structure of reasoning) rather than what to produce (the outcome of reasoning) and extend traditional RLVR to open ended settings. We introduce structure aware reinforcement learning (SARL), a label free framework that constructs a per response Reasoning Map from intermediate thinking steps and rewards its small world topology, inspired by complex networks and the functional organization of the human brain. SARL encourages reasoning trajectories that are both locally coherent and globally efficient, shifting supervision from destination to path. Our experiments on Qwen3-4B show SARL surpasses ground truth based RL and prior label free RL baselines, achieving the best average gain of 9.1% under PPO and 11.6% under GRPO on math tasks and 34.6% under PPO and 30.4% under GRPO on open ended tasks. Beyond good performance, SARL also exhibits lower KL divergence, higher policy entropy, indicating a more stable and exploratory training and generalized reasoning ability.