Adaptive Thinking: Large Language Models Know When to Think in Latent Space

Apple Machine Learning Journal / 4/29/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies how to allocate “test-time thinking” (extra inference compute) in large language models to improve performance without wasting compute.
  • It notes that although increasing the thinking budget can boost results, the optimal relationship between model capability, query complexity, and budget allocation is still unclear.
  • The approach uses self-consistency—agreement across multiple reasoning paths—as a signal for whether additional latent-space thinking is actually necessary.
  • The method proceeds by identifying when latent-space thinking should be triggered and then adapting the amount/type of reasoning accordingly for compute-optimal inference.
  • Overall, it targets more efficient LLM inference by deciding “when to think” rather than always using a larger reasoning budget.
Recent advances in large language models (LLMs) test-time computing have introduced the capability to perform intermediate chain-of-thought (CoT) reasoning (thinking) before generating answers. While increasing the thinking budget yields smooth performance improvements at inference time, the relationship between LLM capability, query complexity, and optimal budget allocation remains poorly understood for achieving compute-optimal inference. To address this challenge, we utilize self-consistency, the agreement among multiple reasoning paths, as a proxy for thinking necessity. We first identify…

Continue reading this article on the original site.

Read original →