Tracking vs. Deciding: The Dual-Capability Bottleneck in Searchless Chess Transformers

arXiv cs.AI / 4/1/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that “searchless” chess transformers trained only on move sequences must learn two distinct but conflicting capabilities: state tracking from move history and decision quality for choosing good moves.
  • It formalizes this as a dual-capability bottleneck (performance limited by the weaker of tracking or decision learning), explaining why low-rated games help tracking diversity while high-rated games provide better decision signals, and removing low-rated data harms results.
  • The authors scale the model from 28M to 120M parameters to improve tracking, then use Elo-weighted training to boost decision quality while preserving diversity, finding that the two interventions combine superadditively.
  • Their experiments show scaling improves tracking, weighting improves decisions, and linear weighting is best; overly aggressive weighting can damage tracking even if validation loss decreases.
  • The 120M-parameter model (no search) reaches Lichess Bullet ~2570 and achieves 55.2% Top-1 accuracy on human move prediction, while sequence-based input enables history-dependent behavior that position-only methods lack.

Abstract

A human-like chess engine should mimic the style, errors, and consistency of a strong human player rather than maximize playing strength. We show that training from move sequences alone forces a model to learn two capabilities: state tracking, which reconstructs the board from move history, and decision quality, which selects good moves from that reconstructed state. These impose contradictory data requirements: low-rated games provide the diversity needed for tracking, while high-rated games provide the quality signal for decision learning. Removing low-rated data degrades performance. We formalize this tension as a dual-capability bottleneck, P <= min(T,Q), where overall performance is limited by the weaker capability. Guided by this view, we scale the model from 28M to 120M parameters to improve tracking, then introduce Elo-weighted training to improve decisions while preserving diversity. A 2 x 2 factorial ablation shows that scaling improves tracking, weighting improves decisions, and their combination is superadditive. Linear weighting works best, while overly aggressive weighting harms tracking despite lower validation loss. We also introduce a coverage-decay formula, t* = log(N/kcrit)/log b, as a reliability horizon for intra-game degeneration risk. Our final 120M-parameter model, without search, reached Lichess bullet 2570 over 253 rated games. On human move prediction it achieves 55.2% Top-1 accuracy, exceeding Maia-2 rapid and Maia-2 blitz. Unlike position-based methods, sequence input naturally encodes full game history, enabling history-dependent decisions that single-position models cannot exhibit.