ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation

arXiv cs.CL / 4/30/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces ClassEval-Pro, a new cross-domain benchmark focused on class-level (compositional) code generation that bridges the gap between function-level synthesis and full repository edits.
  • ClassEval-Pro comprises 300 tasks across 11 domains, built with an automated three-stage pipeline and includes real GitHub code added after January 2025 to reduce contamination and improve realism.
  • Each task is validated using an LLM Judge Ensemble and requires passing test suites with over 90% line coverage, aiming for robust evaluation quality.
  • Five frontier LLMs are tested with five generation strategies, achieving a best class-level Pass@1 of 45.6% and showing a substantial 17.7-point performance gap between the strongest and weakest models.
  • The results show that generation strategy matters greatly—structured bottom-up methods help weak models by up to 9.4 points, while compositional generation can drop to 1.3%, and the dominant failure causes are logic errors (56.2%) and dependency errors (38.0%).

Abstract

LLMs have achieved strong results on both function-level code synthesis and repository-level code modification, yet a capability that falls between these two extremes -- compositional code creation, i.e., building a complete, internally structured class from a specification -- remains underserved. Current evaluations are either confined to isolated functions or rely on manually curated class-level tasks that are expensive to scale and increasingly susceptible to data contamination. We introduce ClassEval-Pro, a benchmark of 300 class-level tasks spanning 11 domains, constructed through an automated three-stage pipeline that combines complexity enhancement, cross-domain class composition, and integration of real-world GitHub code contributed after January 2025. Every task is validated by an LLM Judge Ensemble and must pass test suites with over 90% line coverage. We evaluate five frontier LLMs under five generation strategies. The best model achieves only 45.6% class-level Pass@1, with a 17.7-point gap between the strongest and weakest models, confirming the benchmark's discriminative power. Strategy choice strongly interacts with model capability: structured approaches such as bottom-up improve weaker models by up to 9.4 percentage points, while compositional generation collapses to as low as 1.3%. Error analysis over 500 manually annotated failures reveals that logic errors (56.2%) and dependency errors (38.0%) dominate, identifying cross-method coordination as the core bottleneck.