AI Navigate

More Rounds, More Noise: Why Multi-Turn Review Fails to Improve Cross-Context Verification

arXiv cs.CL / 3/18/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The study compares multi-turn Dynamic Cross-Context Review (D-CCR) variants against a single-pass CCR baseline in a controlled experiment with 30 artifacts and 150 injected errors, finding single-pass CCR to significantly outperform all multi-turn variants (F1 = 0.376 vs. as low as 0.303, p < 0.001, d = -0.59).
  • Multi-turn review increases recall by about 0.08 but generates 62% more false positives (8.5 vs. 5.2), reducing precision from 0.30 to 0.20, indicating a net degradation in verification quality.
  • The degradation is driven by false positive pressure in later rounds and Review Target Drift, where reviewers shift from evaluating the artifact to critiquing the conversation itself.
  • Independent re-review without prior context (D-CCR-2c) performs worst (F1 = 0.263), suggesting that mere repetition adds noise rather than improvement, and that more information within multi-turn settings does not overcome this noise.

Abstract

Cross-Context Review (CCR) improves LLM verification by separating production and review into independent sessions. A natural extension is multi-turn review: letting the reviewer ask follow-up questions, receive author responses, and review again. We call this Dynamic Cross-Context Review (D-CCR). In a controlled experiment with 30 artifacts and 150 injected errors, we tested four D-CCR variants against the single-pass CCR baseline. Single-pass CCR (F1 = 0.376) significantly outperformed all multi-turn variants, including D-CCR-2b with question-and-answer exchange (F1 = 0.303, p < 0.001, d = -0.59). Multi-turn review increased recall (+0.08) but generated 62% more false positives (8.5 vs. 5.2), collapsing precision from 0.30 to 0.20. Two mechanisms drive this degradation: (1) false positive pressure -- reviewers in later rounds fabricate findings when the artifact's real errors have been exhausted, and (2) Review Target Drift -- reviewers provided with prior Q&A exchanges shift from reviewing the artifact to critiquing the conversation itself. Independent re-review without prior context (D-CCR-2c) performed worst (F1 = 0.263), confirming that mere repetition degrades rather than helps. The degradation stems from false positive pressure in additional rounds, not from information amount -- within multi-turn conditions, more information actually helps (D-CCR-2b > D-CCR-2a). The problem is not what the reviewer sees, but that reviewing again invites noise.