Reasoning Gets Harder for LLMs Inside A Dialogue

arXiv cs.CL / 3/23/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces BOULDER, a dynamic benchmark with eight travel-related tasks requiring arithmetic, spatial, and temporal reasoning, and presents both isolated and dialogue-based variants for controlled comparison.
It reports a substantial and consistent performance gap between isolated and dialogue-based reasoning across eight LLMs, highlighting challenges in reasoning under real-world dialogue conditions.
The gap is largely attributed to the multi-turn nature of dialogue, with additional effects from role conditioning and tool-use requirements in task-oriented dialogue.
The authors argue that evaluating LLM reasoning in realistic interactive scenarios is necessary to accurately assess practical capabilities and limitations.

Abstract

Large Language Models (LLMs) achieve strong performance on many reasoning benchmarks, yet these evaluations typically focus on isolated tasks that differ from real-world usage in task-oriented dialogue (TOD). In this setting, LLMs must perform reasoning inherently while generating text and adhering to instructions on role, format, and style. This mismatch raises concerns about whether benchmark performance accurately reflects models' reasoning robustness in TOD setting. We investigate how framing reasoning tasks within TOD affects LLM performance by introducing BOULDER, a new dynamic benchmark covering eight travel-related tasks that require arithmetic, spatial, and temporal reasoning with both commonsense and formal aspects. Each problem is presented in both isolated and dialogue-based variants, enabling controlled comparison while mitigating data contamination. Experiments on eight LLMs reveal a substantial and consistent performance gap between isolated and dialogue settings. Through ablations and qualitative analysis, we show that this gap is largely driven by the multi-turn nature of dialogue, with additional effects from role conditioning and tool-use requirements. Our results highlight the need to evaluate LLM reasoning in realistic interactive scenarios.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/23DailyView insight →

How political censorship actually works inside Qwen, DeepSeek, GLM, and Yi: Ablation and behavioral results across 9 models

Reddit r/LocalLLaMA

Engenharia de Prompt: Por Que a Forma Como Você Pergunta Muda Tudo(Um guia introdutório)

Dev.to

The Obligor

Dev.to

The Markup

Dev.to

2026 年 AI 部落格變現完整攻略：從第一篇文章到月收入 $1000

Dev.to

Reasoning Gets Harder for LLMs Inside A Dialogue

Key Points

Abstract

💡 Insights using this article

Related Articles

How political censorship actually works inside Qwen, DeepSeek, GLM, and Yi: Ablation and behavioral results across 9 models

Engenharia de Prompt: Por Que a Forma Como Você Pergunta Muda Tudo(Um guia introdutório)

The Obligor

The Markup

2026 年 AI 部落格變現完整攻略：從第一篇文章到月收入 $1000

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer