Reasoning Through Chess: How Reasoning Evolves from Data Through Fine-Tuning and Reinforcement Learning

arXiv cs.LG / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies how LLM reasoning in chess improves as training progresses from supervised fine-tuning (SFT) to reinforcement learning (RL) using theoretically inspired datasets.
  • It finds that SFT to directly predict the best move can make RL effective and yield strong downstream performance, but that the resulting RL may produce unfaithful reasoning that is inconsistent with the selected move.
  • Training on multi-move trajectories achieves similar downstream chess performance while improving “faithful reasoning” and making RL training more stable.
  • The authors report that RL shifts the distribution of move quality positively and reduces hallucination rates, and they identify SFT checkpoint metrics (evaluation, hallucinations, reasoning quality) that predict post-RL performance.
  • They release checkpoints, final models, training data, evaluations, and code, claiming a 7B-parameter model that surpasses leading open-source reasoning models in chess.

Abstract

How can you get a language model to reason in a task it natively struggles with? We study how reasoning evolves in a language model -- from supervised fine-tuning (SFT) to reinforcement learning (RL) -- by analyzing how a set of theoretically-inspired datasets impacts language model performance in chess. We find that fine-tuning a model to directly predict the best move leads to effective RL and the strongest downstream performance -- however, the RL step elicits unfaithful reasoning (reasoning inconsistent with the chosen move). Alternatively, training on multi-move trajectories yields comparable downstream performance with faithful reasoning and more stable RL. We show that RL induces a substantial positive shift in the distribution of move quality and reduces hallucination rates as a side effect. Finally, we find several SFT-checkpoint metrics -- metrics spanning evaluation performance, hallucination rates, and reasoning quality -- to be predictive of post-RL model performance. We release checkpoints and final models as well as training data, evaluations, and code which allowed us to surpass leading open-source reasoning models in chess with a 7B-parameter model.