ATime-Consistent Benchmark for Repository-Level Software Engineering Evaluation

arXiv cs.AI / 3/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a time-consistent benchmark methodology for repository-aware software engineering evaluation by snapshotting a repository at time T0 and restricting knowledge to artifacts available before T0.
  • It derives natural-language engineering tasks from future pull requests (T0, T1] and evaluates a single software engineering agent in matched A/B settings with and without repository-derived code knowledge while holding other factors constant.
  • An LLM-assisted prompt-generation pipeline is used to transform historical pull requests into tasks, addressing issues like synthetic task design, prompt leakage, and temporal contamination.
  • In baseline experiments on the DragonFly and React repositories using Claude-family models and multiple prompt granularities, file-level F1 increases monotonically with better prompt guidance, reaching around 0.808 for the strongest tested setup.
  • The authors conclude that prompt construction is a primary benchmark variable and emphasize that temporal consistency and strong prompt control are essential for valid evaluation of repository-aware systems.

Abstract

Evaluation of repository-aware software engineering systems is often confounded by synthetic task design, prompt leakage, and temporal contamination between repository knowledge and future code changes. We present a time-consistent benchmark methodology that snapshots a repository at time T0, constructs repository-derived code knowledge using only artifacts available before T0, and evaluates on engineering tasks derived from pull requests merged in the future interval (T0, T1]. Each historical pull request is transformed into a natural-language task through an LLM-assisted prompt-generation pipeline, and the benchmark is formalized as a matched A/B comparison in which the same software engineering agent is evaluated with and without repository-derived code knowledge while all other variables are held constant. We also report a baseline characterization study on two open-source repositories, DragonFly and React, using three Claude-family models and four prompt granularities. Across both repositories, file-level F1 increases monotonically from minimal to guided prompts, reaching 0.8081 on DragonFly and 0.8078 on React for the strongest tested model. These results show that prompt construction is a first-order benchmark variable. More broadly, the benchmark highlights that temporal consistency and prompt control are core validity requirements for repository-aware software engineering evaluation.