Reasoning Shift: How Context Silently Shortens LLM Reasoning

arXiv cs.LG / 4/2/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper evaluates multiple reasoning-focused LLMs across three setups that vary the amount and nature of surrounding context, including long irrelevant context and multi-turn/task-subtask framing.
It finds that LLMs can silently “compress” their reasoning traces for the same underlying problem—producing up to 50% shorter traces when context is present versus when the problem is isolated.
The trace shortening is linked to reduced self-verification and uncertainty-management behaviors, such as fewer double-checking steps.
While the compression does not significantly hurt performance on simpler problems, it may degrade performance on harder, more complex reasoning tasks.
The authors highlight the need for better robustness testing of reasoning behaviors and for improved context management in LLMs and LLM-based agents.

Abstract

Large language models (LLMs) exhibiting test-time scaling behavior, such as extended reasoning traces and self-verification, have demonstrated remarkable performance on complex, long-term reasoning tasks. However, the robustness of these reasoning behaviors remains underexplored. To investigate this, we conduct a systematic evaluation of multiple reasoning models across three scenarios: (1) problems augmented with lengthy, irrelevant context; (2) multi-turn conversational settings with independent tasks; and (3) problems presented as a subtask within a complex task. We observe an interesting phenomenon: reasoning models tend to produce much shorter reasoning traces (up to 50%) for the same problem under different context conditions compared to the traces produced when the problem is presented in isolation. A finer-grained analysis reveals that this compression is associated with a decrease in self-verification and uncertainty management behaviors, such as double-checking. While this behavioral shift does not compromise performance on straightforward problems, it might affect performance on more challenging tasks. We hope our findings draw additional attention to both the robustness of reasoning models and the problem of context management for LLMs and LLM-based agents.