Grounded Chess Reasoning in Language Models via Master Distillation

arXiv cs.AI / 2026/3/24

💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

要点

The paper proposes “Master Distillation,” a framework that distills expert system reasoning into language-model chain-of-thought explanations, aiming to make reasoning both grounded and faithful in data-scarce domains.
Rather than training only on final outputs, it transfers the full step-by-step reasoning process from an expert system, turning typically opaque computations into transparent explanations.
Demonstrated in chess, the authors report that their 4B-parameter model “C1” rises from a near-zero baseline to 48.1% accuracy, outperforming open-source models and many proprietary systems.
C1 is said to beat its distillation teacher and produce solutions with dramatically fewer tokens than baseline approaches, while also providing explainable strategic reasoning rather than just best-move prediction.
The training pipeline combines supervised fine-tuning, reinforcement learning, and theme-balanced data sampling to broaden tactical coverage, positioning the method as a general recipe for injecting expert knowledge into smaller models.

Abstract

Language models often lack grounded reasoning capabilities in specialized domains where training data is scarce but bespoke systems excel. We introduce a general framework for distilling expert system reasoning into natural language chain-of-thought explanations, enabling compact models to acquire domain expertise and the ability to generate faithful, grounded explanations. Rather than distilling only final outputs, we capture the full reasoning process, transforming opaque expert computations into transparent, step-by-step explanations. We demonstrate this approach in chess, a canonical reasoning domain where language models continue to underperform. Our 4B parameter model, C1, advances from a near-zero baseline to 48.1% accuracy, outperforming all open-source models and most frontier proprietary systems. Notably, C1 surpasses its distillation teacher and generates solutions in two orders of magnitude fewer tokens than baselines. Unlike prior neural chess approaches that predict only best moves, C1 generates explainable solutions revealing strategic reasoning. Our pipeline combines supervised fine-tuning and reinforcement learning with theme-balanced data sampling for comprehensive tactical coverage. Master Distillation demonstrates how to inject expert-level knowledge into compact models for under-optimized domains, offering a recipe for unlocking RLVR where LLMs lack sufficient base capabilities.