LLM Reasoning Is Latent, Not the Chain of Thought

arXiv cs.AI / 4/20/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that LLM “reasoning” should be studied as the formation of latent-state trajectories rather than as a faithful, observable chain-of-thought (CoT) on the surface.
  • It explains that many conclusions—such as faithfulness, interpretability, reasoning benchmarks, and intervention-at-inference—depend on what researchers assume the primary unit of reasoning is.
  • The authors disentangle three frequently confounded factors and formalize competing hypotheses: reasoning via latent trajectories (H1), reasoning via explicit surface CoT (H2), or reasoning gains driven mainly by generic serial compute (H0).
  • By reorganizing prior empirical/mechanistic/survey evidence and adding compute-audited examples that separate surface traces from latent interventions and budget increases, the paper finds current evidence most strongly supports H1 as a default hypothesis.
  • The paper recommends that the field adopt latent-state dynamics as the default object of study and evaluate reasoning using experimental designs that explicitly disentangle surface traces, latent states, and serial compute.

Abstract

This position paper argues that large language model (LLM) reasoning should be studied as latent-state trajectory formation rather than as faithful surface chain-of-thought (CoT). This matters because claims about faithfulness, interpretability, reasoning benchmarks, and inference-time intervention all depend on what the field takes the primary object of reasoning to be. We ask what that object should be once three often-confounded factors are separated and formalize three competing hypotheses: H1, reasoning is primarily mediated by latent-state trajectories; H2, reasoning is primarily mediated by explicit surface CoT; and H0, most apparent reasoning gains are better explained by generic serial compute than by any privileged representational object. Reorganizing recent empirical, mechanistic, and survey work under this framework, and adding compute-audited worked exemplars that factorize surface traces, latent interventions, and matched budget expansions, we find that current evidence most strongly supports H1 as a default working hypothesis rather than as a task-independent verdict. We therefore make two recommendations: the field should treat latent-state dynamics as the default object of study for LLM reasoning, and it should evaluate reasoning with designs that explicitly disentangle surface traces, latent states, and serial compute.