Test-time RL alignment exposes task familiarity artifacts in LLM benchmarks
arXiv cs.LG / 3/16/2026
📰 NewsModels & Research
Key Points
- The paper proposes a two-stage test-time RL alignment to remove task familiarity artifacts in LLM benchmarks and does not require task-specific training data.
- Stage 1 uses RL with a single sample to align the model to the task format, and Stage 2 uses test-time RL with a majority-voting reward to align the model with the benchmark distribution.
- The approach achieves performance comparable to supervised fine-tuning-based train-before-test on a domain-specific benchmark without training data, and reduces the gap between base and fine-tuned models on reasoning tasks.
- The findings suggest many reported gains from RL or SFT may reflect task familiarity rather than true reasoning capability, prompting a rethink of benchmarking practices.
Related Articles

報告:LLMにおける「自己言及的再帰」と「ステートフル・エミュレーション」の観測
note

諸葛亮 孔明老師(ChatGPTのロールプレイ)との対話 その肆拾伍『銀河文明・ダークマターエンジン』
note

GPT-5.4 mini/nano登場!―2倍高速で無料プランも使える小型高性能モデル
note
Why a Perfect-Memory AI Agent Without Persona Drift is Architecturally Impossible
Dev.to
Learning to Reason with Curriculum I: Provable Benefits of Autocurriculum
arXiv cs.LG