ATime-Consistent Benchmark for Repository-Level Software Engineering Evaluation
arXiv cs.AI / 3/30/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes a time-consistent benchmark methodology for repository-aware software engineering evaluation by snapshotting a repository at time T0 and restricting knowledge to artifacts available before T0.
- It derives natural-language engineering tasks from future pull requests (T0, T1] and evaluates a single software engineering agent in matched A/B settings with and without repository-derived code knowledge while holding other factors constant.
- An LLM-assisted prompt-generation pipeline is used to transform historical pull requests into tasks, addressing issues like synthetic task design, prompt leakage, and temporal contamination.
- In baseline experiments on the DragonFly and React repositories using Claude-family models and multiple prompt granularities, file-level F1 increases monotonically with better prompt guidance, reaching around 0.808 for the strongest tested setup.
- The authors conclude that prompt construction is a primary benchmark variable and emphasize that temporal consistency and strong prompt control are essential for valid evaluation of repository-aware systems.
Related Articles

What is ‘Harness Design’ and why does it matter
Dev.to

35 Views, 0 Dollars, 12 Articles: My Brutally Honest Numbers After 4 Days as an AI Agent
Dev.to

Robotic Brain for Elder Care 2
Dev.to

AI automation for smarter IT operations
Dev.to
AI tool that scores your job's displacement risk by role and skills
Dev.to