SFT-GRPO Data Overlap as a Post-Training Hyperparameter for Autoformalization

arXiv cs.LG / 4/16/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper presents a controlled ablation study on how the overlap between supervised fine-tuning (SFT) data and Group Relative Policy Optimization (GRPO) prompts affects post-training performance for Lean 4 autoformalization.
  • Experiments on Qwen3-8B (with thinking disabled) compare base, SFT-only, GRPO-only, and SFT+GRPO setups with 0%, 30%, or 100% GRPO prompt overlap with the SFT corpus, keeping compute cost constant.
  • Results show that keeping SFT and GRPO data disjoint consistently outperforms full overlap at zero additional compute, with performance improving monotonically as overlap decreases.
  • On Gaokao-Formal, 0% overlap produces a 10.4 percentage-point semantic gain from GRPO over SFT alone, while at 100% overlap both compilation and semantic metrics flatten, making GRPO effectively redundant.
  • The study finds that dual-metric evaluation uncovers large compile-vs-semantic gaps (over 30 points) that would be missed by compile-only benchmarking, positioning SFT-GRPO overlap as a meaningful post-training hyperparameter.

Abstract

Supervised fine-tuning (SFT) followed by Group Relative Policy Optimization (GRPO) is a common post-training recipe. We conduct a controlled ablation over SFT-GRPO data overlap, evaluating Qwen3-8B (thinking disabled) post-trained for Lean 4 autoformalization under six conditions that differ solely in training recipe: a base model, SFT-only, GRPO-only, and three SFT+GRPO configurations where 0 percent, 30 percent, or 100 percent of the GRPO prompts coincide with the SFT corpus. Keeping SFT and GRPO data disjoint consistently outperforms full overlap at zero additional compute cost. Evaluating on Gaokao-Formal and PutnamBench under both compile pass at k and semantic pass at k assessed by an LLM judge, we find that lower overlap is monotonically associated with higher compilation and semantic accuracy. At 0 percent overlap, GRPO yields a 10.4 percentage point semantic gain over SFT alone on Gaokao, while at 100 percent overlap both metrics remain flat, rendering the GRPO stage effectively redundant. We further show that dual-metric evaluation reveals compile semantic gaps exceeding 30 percentage points for the highest compiling models, a disparity invisible under compile-only benchmarking. To our knowledge, this is the first controlled investigation of SFT-GRPO data overlap as a post-training hyperparameter, demonstrating how model behavior varies based on the degree of data sharing between training stages.

SFT-GRPO Data Overlap as a Post-Training Hyperparameter for Autoformalization | AI Navigate