AI Navigate

Test-time RL alignment exposes task familiarity artifacts in LLM benchmarks

arXiv cs.LG / 3/16/2026

📰 NewsModels & Research

Key Points

  • The paper proposes a two-stage test-time RL alignment to remove task familiarity artifacts in LLM benchmarks and does not require task-specific training data.
  • Stage 1 uses RL with a single sample to align the model to the task format, and Stage 2 uses test-time RL with a majority-voting reward to align the model with the benchmark distribution.
  • The approach achieves performance comparable to supervised fine-tuning-based train-before-test on a domain-specific benchmark without training data, and reduces the gap between base and fine-tuned models on reasoning tasks.
  • The findings suggest many reported gains from RL or SFT may reflect task familiarity rather than true reasoning capability, prompting a rethink of benchmarking practices.

Abstract

Direct evaluation of LLMs on benchmarks can be misleading because comparatively strong performance may reflect task familiarity rather than capability. The train-before-test approach controls for task familiarity by giving each model task-relevant training before evaluation, originally through supervised finetuning. However, suitable training data is often hard to come by, and evaluation results vary with the data chosen. In this paper, we propose a two-stage test-time reinforcement learning (RL) alignment method for train-before-test. First, RL with a single sample provides a first alignment of the model to the task format, and second, test-time RL with majority-voting reward aligns the model to the benchmark distribution. Our test-time RL alignment method aligns similarly well as SFT-based train-before test, but without requiring a task-specific training set. On a domain-specific benchmark without training data, we show that direct evaluation underestimates base models which perform substantially better once aligned, yielding a more faithful evaluation of their capabilities. Moreover, for reasoning tasks, the performance gap between fine-tuned models and their base models largely disappears after alignment, suggesting that many gains from RLVR/SFT reported in the literature are not a difference in reasoning capability, but rather artifacts of task familiarity.