Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis
arXiv cs.CL / 3/24/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper identifies a credibility crisis in LLM coding benchmarks, noting that existing contamination detection methods cannot directly distinguish reasoning from recall and that repeated verification can increase false positives.
- It proposes Cross-Context Verification (CCV), a black-box approach that runs the same benchmark in N independent, session-isolated contexts and uses solution diversity to detect contaminated vs genuine reasoning.
- Using 9 SWE-bench Verified problems (45 trials) with Claude Opus 4.6 at temperature 0, CCV reports perfect separation between contaminated and genuine reasoning, with results indicating reasoning absence is a strong discriminator.
- The study finds contamination labels previously used in the benchmark pipeline include many false positives (33%) and introduces Hierarchical Cross-Context Architecture (HCCA) to reduce confirmation bias via intentionally restricted, specialized multi-agent analysis.
- A follow-on multi-stage verification pilot (Worker→Verifier→Director) failed due to “sycophantic confirmation,” reinforcing that information restriction is more important than adding structural complexity; the authors release code and data.
Related Articles

Lemonade 10.0.1 improves setup process for using AMD Ryzen AI NPUs on Linux
Reddit r/artificial
The 2026 Developer Showdown: Claude Code vs. Google Antigravity
Dev.to

Google March 2026 Spam Update: SEO Impact and What to Do Now | MKDM
Dev.to
CRM Development That Drives Growth
Dev.to

Karpathy's Autoresearch: Improving Agentic Coding Skills
Dev.to