Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis

arXiv cs.CL / 3/24/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper identifies a credibility crisis in LLM coding benchmarks, noting that existing contamination detection methods cannot directly distinguish reasoning from recall and that repeated verification can increase false positives.
  • It proposes Cross-Context Verification (CCV), a black-box approach that runs the same benchmark in N independent, session-isolated contexts and uses solution diversity to detect contaminated vs genuine reasoning.
  • Using 9 SWE-bench Verified problems (45 trials) with Claude Opus 4.6 at temperature 0, CCV reports perfect separation between contaminated and genuine reasoning, with results indicating reasoning absence is a strong discriminator.
  • The study finds contamination labels previously used in the benchmark pipeline include many false positives (33%) and introduces Hierarchical Cross-Context Architecture (HCCA) to reduce confirmation bias via intentionally restricted, specialized multi-agent analysis.
  • A follow-on multi-stage verification pilot (Worker→Verifier→Director) failed due to “sycophantic confirmation,” reinforcing that information restriction is more important than adding structural complexity; the authors release code and data.

Abstract

LLM coding benchmarks face a credibility crisis: widespread solution leakage and test quality issues undermine SWE-bench Verified, while existing detection methods--paraphrase consistency, n-gram overlap, perplexity analysis--never directly observe whether a model reasons or recalls. Meanwhile, simply repeating verification degrades accuracy: multi-turn review generates false positives faster than it discovers true errors, suggesting that structural approaches are needed. We introduce Cross-Context Verification (CCV), a black-box method that solves the same benchmark problem in N independent sessions and measures solution diversity, combined with the Hierarchical Cross-Context Architecture (HCCA), a multi-agent analysis framework that prevents confirmation bias through intentional information restriction across specialized analytical roles. On 9 SWE-bench Verified problems (45 trials, Claude Opus 4.6, temperature 0), CCV achieves perfect separation between contaminated and genuine reasoning (Mann-Whitney U=0, p approx 0.012, r = 1.0). Key findings: (1) contamination is binary--models either recall perfectly or not at all; (2) reasoning absence is a perfect discriminator; (3) 33% of prior contamination labels are false positives; (4) HCCA's independent analysis structure discovers contamination-flaw composite cases that single-analyst approaches miss. A pilot experiment extending HCCA to multi-stage verification (Worker to Verifier to Director) yields a negative result--100% sycophantic confirmation--providing further evidence that information restriction, not structural complexity, is the key mechanism. We release all code and data.