S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models

arXiv cs.CL / 4/3/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The arXiv paper introduces “S0 tuning,” a parameter-efficient fine-tuning method that optimizes a single state-matrix per recurrent layer while freezing all original model weights and adding zero inference overhead.
  • Using only about 48 execution-verified HumanEval training solutions, S0 tuning outperforms LoRA by +10.8 percentage points on HumanEval, and achieves larger gains on specific hybrid models such as Qwen3.5-4B and FalconH1-7B.
  • For hybrid recurrent-attention models, S0 tuning improves greedy pass@1 on Qwen3.5-4B by +23.6±1.7 pp and reaches 71.8%±1.3 on FalconH1-7B, with results that are statistically indistinguishable from LoRA at the reported sample sizes.
  • The method shows meaningful cross-domain transfer on MATH-500 (+4.8 pp) and GSM8K (+2.8 pp) but not on Spider text-to-SQL, aligning with an explanation that it steers the model trajectory rather than learning transferable syntax/semantics.
  • A control experiment indicates that similar prefix-tuning on a pure Transformer degrades performance, while a per-step state-offset variant can do better but at the cost of per-step inference overhead; the tuned state is ~48 MB and task switching does not require weight merging or model reload.

Abstract

Using roughly 48 execution-verified HumanEval training solutions, tuning a single initial state matrix per recurrent layer, with zero inference overhead, outperforms LoRA by +10.8 pp (p < 0.001) on HumanEval. The method, which we call S0 tuning, optimizes one state matrix per recurrent layer while freezing all model weights. On Qwen3.5-4B (GatedDeltaNet hybrid), S0 tuning improves greedy pass@1 by +23.6 +/- 1.7 pp (10 seeds). On FalconH1-7B (Mamba-2 hybrid), S0 reaches 71.8% +/- 1.3 and LoRA reaches 71.4% +/- 2.4 (3 seeds), statistically indistinguishable at this sample size while requiring no weight merging. Cross-domain transfer is significant on MATH-500 (+4.8 pp, p = 0.00002, 8 seeds) and GSM8K (+2.8 pp, p = 0.0003, 10 seeds); a text-to-SQL benchmark (Spider) shows no transfer, consistent with the trajectory-steering mechanism. A prefix-tuning control on a pure Transformer (Qwen2.5-3B) degrades performance by -13.9 pp under all nine configurations tested. On Qwen3.5, a per-step state-offset variant reaches +27.1 pp, above both S0 and LoRA but with per-step inference cost. Taken together, the results show that recurrent state initialization is a strong zero-inference-overhead PEFT surface for hybrid language models when verified supervision is scarce. The tuned state is a ~48 MB file; task switching requires no weight merging or model reload. Code and library: https://github.com/jackyoung27/s0-tuning.