Learning to Choose or Choosing to Learn: Best-of-N vs. Supervised Fine-Tuning for Bit String Generation

arXiv stat.ML / 3/31/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper uses the bit string generation problem to theoretically compare two LLM adaptation approaches: supervised fine-tuning (SFT) and Best-of-N (BoN) response selection via a learned reward model.
  • Under a realizable learning assumption, supervised fine-tuning is shown to outperform BoN, particularly due to a more favorable dependence on response length in its convergence rate.
  • If the realizability condition fails, the results depend on the specific failure mode: BoN may achieve better convergence rates either as a function of the number of candidates (N) or with improved scaling relative to response length.
  • Overall, the study frames when each strategy is likely to be preferable, linking performance differences to assumptions about how well the training objective matches the true task structure.

Abstract

Using the bit string generation problem as a case study, we theoretically compare two standard methods for adapting large language models to new tasks. The first, referred to as supervised fine-tuning, involves training a new next token predictor on good generations. The second method, Best-of-N, trains a reward model to select good responses from a collection generated by an unaltered base model. If the learning setting is realizable, we find that supervised fine-tuning outperforms BoN through a better dependence on the response length in its rate of convergence. If realizability fails, then depending on the failure mode, BoN can enjoy a better rate of convergence in either n or a rate of convergence with better dependence on the response length.