Dual-Track CoT: Budget-Aware Stepwise Guidance for Small LMs

arXiv cs.CL / 4/29/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper investigates how small language models (around 7–8B parameters) can perform multi-step reasoning under strict compute and token budgets using chain-of-thought prompting.
  • It argues that existing test-time reasoning approaches (e.g., self-consistency, Tree-of-Thoughts, and critique–revise loops) often improve accuracy at the expense of high token cost and lack fine-grained control over each reasoning step.
  • The proposed “Dual-Track CoT” approach targets this gap by providing budget-aware, stepwise guidance with controls such as rejecting redundant steps to improve reliability without increasing tokens.
  • The work frames the contribution as both scientific—testing whether step-level process supervision and simple test-time constraints can replace larger model scale or heavy sampling—and practical for cost- and latency-constrained deployments.
  • The central question is whether small models can achieve reliable reasoning with the same or fewer tokens than prior methods, making it directly relevant for on-device and low-cost inference scenarios.

Abstract

Large Language Models (LLMs) solve many reasoning tasks via chain-of-thought (CoT) prompting, but smaller models (about 7 to 8B parameters) still struggle with multi-step reasoning under tight compute and token budgets. Existing test time reasoning methods such as self consistency (sampling multiple rationales and voting), Tree-of-Thoughts (search over intermediate thoughts), and critique revise loops improve performance, but often at high token cost and without fine-grained step-level control. This project1 aims to address that gap: can Small Language Models (SLMs) reason reliably using the same or fewer tokens? This question is both scientific and practical. Scientifically, it probes whether process supervision and simple test-time controls (such as token budgets and rejection of redundant steps) can substitute for model scale or large sampling counts. Practically, many deployments (on-device, low-latency, or cost-constrained settings) cannot afford huge models or dozens of sampled rationales per query. A method that improves SLM reasoning at fixed cost would therefore be directly useful.