Talking to a Know-It-All GPT or a Second-Guesser Claude? How Repair reveals unreliable Multi-Turn Behavior in LLMs

arXiv cs.CL / 4/22/2026

💬 OpinionModels & Research

Key Points

  • The study explores how LLMs perform “repair” during multi-turn dialogues in math question settings, comparing model-initiated versus user-initiated repairs.
  • Results show large performance differences across LLMs, ranging from being largely resistant to appropriate repair to being overly susceptible and easily manipulated.
  • As dialogues extend beyond a single turn, model behavior becomes more distinctive and less predictable between different systems.
  • The paper concludes that each tested LLM has a characteristic kind of unreliability specifically related to conversational repair.

Abstract

Repair, an important resource for resolving trouble in human-human conversation, remains underexplored in human-LLM interaction. In this study, we investigate how LLMs engage in the interactive process of repair in multi-turn dialogues around solvable and unsolvable math questions. We examine whether models initiate repair themselves and how they respond to user-initiated repair. Our results show strong differences across models: reactions range from being almost completely resistant to (appropriate) repair attempts to being highly susceptible and easily manipulated. We further demonstrate that once conversations extend beyond a single turn, model behavior becomes more distinctive and less predictable across systems. Overall, our findings indicate that each tested LLM exhibits its own characteristic form of unreliability in the context of repair.