SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

arXiv cs.CL / 4/13/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes SUPERNOVA, a data curation framework that extends Reinforcement Learning with Verifiable Rewards (RLVR) from formal reasoning (math/code) to more general reasoning involving causal and temporal understanding.
  • It argues that the main bottleneck for general RLVR is scarce high-quality, verifiable training data, and introduces an approach to adapt expert-annotated instruction-tuning datasets into RLVR-ready training signals.
  • Across 100+ controlled reinforcement learning experiments, the authors analyze how data design choices—source task selection, task mixing strategies, and synthetic interventions—affect downstream reasoning performance.
  • Results show that source task selection is crucial, and selecting tasks based on performance on the specific target task beats approaches that rely on overall average performance.
  • Models trained with SUPERNOVA outperform strong baselines (e.g., Qwen3.5) on benchmarks such as BBEH, Zebralogic, and MMLU-Pro, achieving up to 52.8% relative improvement on BBEH, and the code/data are released on GitHub.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly improved large language model (LLM) reasoning in formal domains such as mathematics and code. Despite these advancements, LLMs still struggle with general reasoning tasks requiring capabilities such as causal inference and temporal understanding. Extending RLVR to general reasoning is fundamentally constrained by the lack of high-quality, verifiable training data that spans diverse reasoning skills. To address this challenge, we propose SUPERNOVA, a data curation framework for RLVR aimed at enhancing general reasoning. Our key insight is that instruction-tuning datasets containing expert-annotated ground-truth encode rich reasoning patterns that can be systematically adapted for RLVR. To study this, we conduct 100+ controlled RL experiments to analyze how data design choices impact downstream reasoning performance. In particular, we investigate three key factors: (i) source task selection, (ii) task mixing strategies, and (iii) synthetic interventions for improving data quality. Our analysis reveals that source task selection is non-trivial and has a significant impact on downstream reasoning performance. Moreover, selecting tasks based on their performance for individual target tasks outperforms strategies based on overall average performance. Finally, models trained on SUPERNOVA outperform strong baselines (e.g., Qwen3.5) on challenging reasoning benchmarks including BBEH, Zebralogic, and MMLU-Pro. In particular, training on SUPERNOVA yields relative improvements of up to 52.8\% on BBEH across model sizes, demonstrating the effectiveness of principled data curation for RLVR. Our findings provide practical insights for curating human-annotated resources to extend RLVR to general reasoning. The code and data is available at https://github.com/asuvarna31/supernova.