ReCUBE: Evaluating Repository-Level Context Utilization in Code Generation
arXiv cs.AI / 3/30/2026
💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- The paper introduces ReCUBE, a new benchmark that isolates and measures how well LLMs utilize repository-level context by having models reconstruct a masked file using only the rest of the repository plus dependency specs and documentation.
- It evaluates generated code using usage-aware tests that cover both internal logic and cross-file integration, aiming to better reflect real-world software behavior than existing coding benchmarks.
- Results across eight models and multiple settings indicate that repository-level context utilization is still difficult even for state-of-the-art systems, with GPT-5 reaching a 37.57% strict pass rate in the full-context setting.
- To improve agentic repository exploration, the authors propose the Caller-Centric Exploration (CCE) toolkit based on dependency graphs, which can guide agents to the most relevant caller files and improves strict pass rates by up to 7.56%.
- The ReCUBE benchmark, code, and evaluation framework are released as open source for the research community.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.




