TurnWise: The Gap between Single- and Multi-turn Language Model Capabilities

arXiv cs.CL / 3/18/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The authors identify a gap between single-turn and multi-turn language model capabilities and propose TurnWiseEval to measure multi-turn performance in a way directly comparable to single-turn chat benchmarks.
They introduce TurnWiseData, a synthetic data pipeline that enables scalable generation of multi-turn training data.
Experiments with Olmo 3 show that incorporating multi-turn data during post-training is vital for strong multi-turn chat performance, with as little as 10k multi-turn conversations yielding about a 12% improvement on TurnWiseEval.
The work emphasizes the importance of multi-turn-focused data and evaluation to close the gap and improve model behavior in longer, more interactive conversations.

Abstract

Multi-turn conversations are a common and critical mode of language model interaction. However, current open training and evaluation data focus on single-turn settings, failing to capture the additional dimension of these longer interactions. To understand this multi-/single-turn gap, we first introduce a new benchmark, TurnWiseEval, for multi-turn capabilities that is directly comparable to single-turn chat evaluation. Our evaluation isolates multi-turn specific conversational ability through pairwise comparison to equivalent single-turn settings. We additionally introduce our synthetic multi-turn data pipeline TurnWiseData which allows the scalable generation of multi-turn training data. Our experiments with Olmo 3 show that training with multi-turn data is vital to achieving strong multi-turn chat performance, and that including as little as 10k multi-turn conversations during post-training can lead to a 12% improvement on TurnWiseEval.

Day 10: 230 Sessions of Hustle and It Comes Down to One Person Reading a Document

Dev.to

5 Dangerous Lies Behind Viral AI Coding Demos That Break in Production

Dev.to

Two bots, one confused server: what Nimbus revealed about AI agent identity

Dev.to

OpenTelemetry just standardized LLM tracing. Here's what it actually looks like in code.

Dev.to

PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark forFinance

Dev.to

TurnWise: The Gap between Single- and Multi-turn Language Model Capabilities

Key Points

Abstract

Related Articles

Day 10: 230 Sessions of Hustle and It Comes Down to One Person Reading a Document

5 Dangerous Lies Behind Viral AI Coding Demos That Break in Production

Two bots, one confused server: what Nimbus revealed about AI agent identity

OpenTelemetry just standardized LLM tracing. Here's what it actually looks like in code.

PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark forFinance

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer