SFT-GRPO Data Overlap as a Post-Training Hyperparameter for Autoformalization

arXiv cs.LG / 4/16/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper presents a controlled ablation study on how the overlap between supervised fine-tuning (SFT) data and Group Relative Policy Optimization (GRPO) prompts affects post-training performance for Lean 4 autoformalization.
Experiments on Qwen3-8B (with thinking disabled) compare base, SFT-only, GRPO-only, and SFT+GRPO setups with 0%, 30%, or 100% GRPO prompt overlap with the SFT corpus, keeping compute cost constant.
Results show that keeping SFT and GRPO data disjoint consistently outperforms full overlap at zero additional compute, with performance improving monotonically as overlap decreases.
On Gaokao-Formal, 0% overlap produces a 10.4 percentage-point semantic gain from GRPO over SFT alone, while at 100% overlap both compilation and semantic metrics flatten, making GRPO effectively redundant.
The study finds that dual-metric evaluation uncovers large compile-vs-semantic gaps (over 30 points) that would be missed by compile-only benchmarking, positioning SFT-GRPO overlap as a meaningful post-training hyperparameter.

Abstract

Supervised fine-tuning (SFT) followed by Group Relative Policy Optimization (GRPO) is a common post-training recipe. We conduct a controlled ablation over SFT-GRPO data overlap, evaluating Qwen3-8B (thinking disabled) post-trained for Lean 4 autoformalization under six conditions that differ solely in training recipe: a base model, SFT-only, GRPO-only, and three SFT+GRPO configurations where 0 percent, 30 percent, or 100 percent of the GRPO prompts coincide with the SFT corpus. Keeping SFT and GRPO data disjoint consistently outperforms full overlap at zero additional compute cost. Evaluating on Gaokao-Formal and PutnamBench under both compile pass at k and semantic pass at k assessed by an LLM judge, we find that lower overlap is monotonically associated with higher compilation and semantic accuracy. At 0 percent overlap, GRPO yields a 10.4 percentage point semantic gain over SFT alone on Gaokao, while at 100 percent overlap both metrics remain flat, rendering the GRPO stage effectively redundant. We further show that dual-metric evaluation reveals compile semantic gaps exceeding 30 percentage points for the highest compiling models, a disparity invisible under compile-only benchmarking. To our knowledge, this is the first controlled investigation of SFT-GRPO data overlap as a post-training hyperparameter, demonstrating how model behavior varies based on the degree of data sharing between training stages.