ProdCodeBench: A Production-Derived Benchmark for Evaluating AI Coding Agents

要点

The authors conclude that iterative code verification and exposing codebase-specific verification mechanisms can significantly improve agents’ performance when operating in unfamiliar environments, and they publish the methodology for others to replicate.

Abstract

Benchmarks that reflect production workloads are better for evaluating AI coding agents in industrial settings, yet existing benchmarks differ from real usage in programming language distribution, prompt style and codebase structure. This paper presents a methodology for curating production-derived benchmarks, illustrated through ProdCodeBench - a benchmark built from real sessions with a production AI coding assistant. We detail our data collection and curation practices including LLM-based task classification, test relevance validation, and multi-run stability checks which address challenges in constructing reliable evaluation signals from monorepo environments. Each curated sample consists of a verbatim prompt, a committed code change and fail-to-pass tests spanning seven programming languages. Our systematic analysis of four foundation models yields solve rates from 53.2% to 72.2% revealing that models making greater use of work validation tools, such as executing tests and invoking static analysis, achieve higher solve rates. This suggests that iterative verification helps achieve effective agent behavior and that exposing codebase-specific verification mechanisms may significantly improve the performance of externally trained agents operating in unfamiliar environments. We share our methodology and lessons learned to enable other organizations to construct similar production-derived benchmarks.

ProdCodeBench: A Production-Derived Benchmark for Evaluating AI Coding Agents

要点

Abstract

関連記事

Black Hat USA

Black Hat Asia

AIの発展で揺れるWordPressそれでも書く意味とは？

Copilotと物語を作ってみた #220 これで我慢なさい

AIを相棒に、世界中のどこにいても自分らしくいられる力。手に入れませんか？

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer