ATime-Consistent Benchmark for Repository-Level Software Engineering Evaluation

arXiv cs.AI / 3/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes a time-consistent benchmark methodology for repository-aware software engineering evaluation by snapshotting a repository at time T0 and restricting knowledge to artifacts available before T0.
It derives natural-language engineering tasks from future pull requests (T0, T1] and evaluates a single software engineering agent in matched A/B settings with and without repository-derived code knowledge while holding other factors constant.
An LLM-assisted prompt-generation pipeline is used to transform historical pull requests into tasks, addressing issues like synthetic task design, prompt leakage, and temporal contamination.
In baseline experiments on the DragonFly and React repositories using Claude-family models and multiple prompt granularities, file-level F1 increases monotonically with better prompt guidance, reaching around 0.808 for the strongest tested setup.
The authors conclude that prompt construction is a primary benchmark variable and emphasize that temporal consistency and strong prompt control are essential for valid evaluation of repository-aware systems.

Abstract

Evaluation of repository-aware software engineering systems is often confounded by synthetic task design, prompt leakage, and temporal contamination between repository knowledge and future code changes. We present a time-consistent benchmark methodology that snapshots a repository at time T0, constructs repository-derived code knowledge using only artifacts available before T0, and evaluates on engineering tasks derived from pull requests merged in the future interval (T0, T1]. Each historical pull request is transformed into a natural-language task through an LLM-assisted prompt-generation pipeline, and the benchmark is formalized as a matched A/B comparison in which the same software engineering agent is evaluated with and without repository-derived code knowledge while all other variables are held constant. We also report a baseline characterization study on two open-source repositories, DragonFly and React, using three Claude-family models and four prompt granularities. Across both repositories, file-level F1 increases monotonically from minimal to guided prompts, reaching 0.8081 on DragonFly and 0.8078 on React for the strongest tested model. These results show that prompt construction is a first-order benchmark variable. More broadly, the benchmark highlights that temporal consistency and prompt control are core validity requirements for repository-aware software engineering evaluation.