Decouple before Integration: Test-time Synthesis of SFT and RLVR Task Vectors

arXiv cs.LG / 5/4/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies why directly integrating SFT and RLVR for LLM post-training is difficult, attributing failures to large magnitude disparity, sign interference, and heterogeneous module-wise update distributions in task vectors.
It proposes Decoupled Test-time Synthesis (DoTS), a post-hoc method that trains SFT and RLVR checkpoints independently and combines their capabilities only at inference via task-vector arithmetic, without updating model parameters.
DoTS reduces interference using selective sparsification with norm-preserving rescaling, then uses Bayesian optimization over a small set of unlabeled queries to find coefficients that balance consistency and perplexity.
Experiments show DoTS matches or exceeds SFT–RLVR integration approaches on multiple mathematical reasoning benchmarks with about 3% of the compute cost, and it can outperform stronger post-trained checkpoints and generalize out of domain without retuning.

Abstract

SFT and RLVR represent two fundamental yet distinct paradigms for LLM post-training, each excelling in distinct dimensions. SFT expands knowledge breadth while RLVR enhances reasoning depth. Yet integrating these complementary strengths remains a formidable challenge. Sequential training can cause catastrophic forgetting, and joint optimization often suffers from severe gradient conflicts. We analyze SFT and RLVR through the lens of task vectors and reveal three structural properties behind these failures: a 30* magnitude disparity, 45* sign interference, and heterogeneous module-wise update distributions. These findings show SFT and RLVR are difficult to integrate directly, but they also suggest that the two paradigms modify partly complementary components of the model. Motivated by these observations, we propose Decoupled Test-time Synthesis (DoTS), a post-hoc framework allows SFT and RLVR checkpoints to be trained independently and synthesizes their capabilities only at inference time via task vector arithmetic, without updating model parameters. To reduce interference, DOTS applies selective sparsification with norm-preserving rescaling. It then uses Bayesian optimization on a small set of unlabeled queries to search for combination coefficients on the Pareto frontier of consistency and perplexity. Empirically, \ours matches or exceeds the performance of training-based SFT--RLVR integration methods across multiple mathematical reasoning benchmarks, incurring only

\sim

3\% of the computational cost. When applied to stronger post-trained checkpoints, DOTS surpasses SOTA models and generalizes to out-of-domain benchmarks without re-tuning. Code is available at https://github.com/chaohaoyuan/DoTS.