LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning

arXiv cs.LG / 4/16/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces LongCoT, a scalable benchmark with 2,500 expert-designed problems across chemistry, mathematics, computer science, chess, and logic to measure long-horizon chain-of-thought reasoning.
  • Each problem has a verifiable answer and requires solving a large graph of interdependent steps that spans tens to hundreds of thousands of reasoning tokens, isolating long-horizon planning/CoT management rather than local step difficulty.
  • The benchmark is designed so that individual sub-steps remain tractable for frontier models, meaning observed errors more directly reflect limitations in sustaining correct reasoning over long horizons.
  • At release, leading models show under 10% accuracy on LongCoT (GPT 5.2: 9.8%, Gemini 3 Pro: 6.1%), indicating a substantial gap in current long-horizon reasoning capabilities.
  • LongCoT is positioned as a rigorous yardstick for tracking and comparing how well frontier language models reason reliably over extended multi-step processes.

Abstract

As language models are increasingly deployed for complex autonomous tasks, their ability to reason accurately over longer horizons becomes critical. An essential component of this ability is planning and managing a long, complex chain-of-thought (CoT). We introduce LongCoT, a scalable benchmark of 2,500 expert-designed problems spanning chemistry, mathematics, computer science, chess, and logic to isolate and directly measure the long-horizon CoT reasoning capabilities of frontier models. Problems consist of a short input with a verifiable answer; solving them requires navigating a graph of interdependent steps that span tens to hundreds of thousands of reasoning tokens. Each local step is individually tractable for frontier models, so failures reflect long-horizon reasoning limitations. At release, the best models achieve <10% accuracy (GPT 5.2: 9.8%; Gemini 3 Pro: 6.1%) on LongCoT, revealing a substantial gap in current capabilities. Overall, LongCoT provides a rigorous measure of long-horizon reasoning, tracking the ability of frontier models to reason reliably over extended periods.