SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request Feedback

arXiv cs.AI / 3/30/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • SWE-PRBench is introduced as a benchmark of 350 human-annotated pull requests with ground truth for measuring AI code review quality, aiming to evaluate how well models catch issues raised in real PR feedback.
  • An LLM-as-judge evaluation framework is validated (kappa=0.75), but results show that eight frontier models detect only 15–31% of human-flagged issues on the diff-only setting, trailing human expert performance despite strong code-generation benchmarks.
  • The study systematically varies available context across three frozen configurations (diff only, diff + file content, full context) and finds all models degrade monotonically from config_A to config_C, even with richer structured context such as AST-derived function context and import-graph resolution.
  • A key failure mechanism is identified as “Type2_Contextual” issue detection collapsing at config_B, consistent with attention dilution from longer prompts/contexts.
  • A prompt design focused on a structured ~2,000-token “diff-with-summary” approach outperforms longer full-context prompts (~2,500 tokens) enriched with execution behavior, test signatures, and related execution context; the dataset, annotations, and harness are released publicly.

Abstract

We introduce SWE-PRBench, a benchmark of 350 pull requests with human-annotated ground truth for evaluating AI code review quality. Evaluated against an LLM-as-judge framework validated at kappa=0.75, 8 frontier models detect only 15-31% of human-flagged issues on the diff-only configuration, demonstrating that AI code review remains far below human expert performance despite strong results on code generation benchmarks. Pull requests are drawn from active open-source repositories, filtered from 700 candidates using a Repository Quality Score, and evaluated under three frozen context configurations: diff only (config_A), diff with file content (config_B), and full context (config_C), enabling systematic ablation of context provision strategies. All 8 models degrade monotonically from config_A to config_C, even when context is provided via structured semantic layers including AST-extracted function context and import graph resolution. The dominant mechanism is a collapse of Type2_Contextual issue detection at config_B, consistent with attention dilution in long contexts: a structured 2,000-token diff-with-summary prompt outperforms a 2,500-token full-context prompt enriched with execution context, behaviour mapping, and test signatures across all 8 models. The top four models are statistically indistinguishable (mean score 0.147-0.153) while a clear tier gap separates them from the remaining four (mean score <= 0.113). Dataset, contexts, annotations, and evaluation harness are released publicly.