Short Chains, Deep Thoughts: Balancing Reasoning Efficiency and Intra-Segment Capability via Split-Merge Optimization

arXiv cs.CL / 5/4/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that long, verbose reasoning chains in Large Reasoning Models cause major latency and compute costs, and proposes addressing redundancy rather than simply limiting token length.
  • It introduces CoSMo (Consistency-Guided Split-Merge Optimization), which uses a split-merge algorithm to dynamically merge redundant reasoning segments and split where logical gaps appear to preserve coherence.
  • The authors use structure-aligned reinforcement learning with a new segment-level budget to train models to maintain efficient reasoning structures over time.
  • Experiments across multiple benchmarks and model backbones show CoSMo improves accuracy by 3.3 points while reducing segment usage by 28.7% on average versus reasoning-efficiency baselines.

Abstract

While Large Reasoning Models (LRMs) have demonstrated impressive capabilities in solving complex tasks through the generation of long reasoning chains, this reliance on verbose generation results in significant latency and computational overhead. To address these challenges, we propose \textbf{CoSMo} (\textbf{Co}nsistency-Guided \textbf{S}plit-\textbf{M}erge \textbf{O}ptimization), a framework designed to eliminate structural redundancy rather than indiscriminately restricting token volume. Specifically, CoSMo utilizes a split-merge algorithm that dynamically refines reasoning chains by merging redundant segments and splitting logical gaps to ensure coherence. We then employ structure-aligned reinforcement learning with a novel segment-level budget to supervise the model in maintaining efficient reasoning structures throughout training. Extensive experiments across multiple benchmarks and backbones demonstrate that CoSMo achieves superior performance, improving accuracy by \textbf{3.3} points while reducing segment usage by \textbf{28.7\%} on average compared to reasoning efficiency baselines.