On the Step Length Confounding in LLM Reasoning Data Selection

arXiv cs.CL / 4/9/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper analyzes how common naturalness-based data selection for LLM reasoning datasets—ranking samples by average log probability—can mis-rank reasoning quality due to a newly identified “step length confounding” effect.
  • It shows the selection method systematically favors samples with more tokens per reasoning step because low-probability first tokens get diluted in longer steps, inflating average log probabilities.
  • The authors attribute the confounding to the low-probability first tokens at the start of each reasoning step, which dominate the average-log-probability metric in a length-dependent way.
  • To mitigate this, they propose ASLEC-DROP (removing first-token probabilities from the averaging) and ASLEC-CASL (causal debiasing regression to remove first-token confounding).
  • Experiments across four LLMs and five benchmarks indicate the proposed methods effectively reduce step length confounding and better align selection with intended reasoning quality.

Abstract

Large reasoning models have recently demonstrated strong performance on complex tasks that require long chain-of-thought reasoning, through supervised fine-tuning on large-scale and high-quality datasets. To construct such datasets, existing pipelines generate long reasoning data from more capable Large Language Models (LLMs) and apply manually heuristic or naturalness-based selection methods to filter high-quality samples. Despite the proven effectiveness of naturalness-based data selection, which ranks data by the average log probability assigned by LLMs, our analysis shows that, when applied to LLM reasoning datasets, it systematically prefers samples with longer reasoning steps (i.e., more tokens per step) rather than higher-quality ones, a phenomenon we term step length confounding. Through quantitative analysis, we attribute this phenomenon to low-probability first tokens in reasoning steps; longer steps dilute their influence, thereby inflating the average log probabilities. To address this issue, we propose two variant methods: ASLEC-DROP, which drops first-token probabilities when computing average log probability, and ASLEC-CASL, which applies a causal debiasing regression to remove the first tokens' confounding effect. Experiments across four LLMs and five evaluation benchmarks demonstrate the effectiveness of our approach in mitigating the step length confounding problem.