ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation
arXiv cs.CL / 4/30/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces ClassEval-Pro, a new cross-domain benchmark focused on class-level (compositional) code generation that bridges the gap between function-level synthesis and full repository edits.
- ClassEval-Pro comprises 300 tasks across 11 domains, built with an automated three-stage pipeline and includes real GitHub code added after January 2025 to reduce contamination and improve realism.
- Each task is validated using an LLM Judge Ensemble and requires passing test suites with over 90% line coverage, aiming for robust evaluation quality.
- Five frontier LLMs are tested with five generation strategies, achieving a best class-level Pass@1 of 45.6% and showing a substantial 17.7-point performance gap between the strongest and weakest models.
- The results show that generation strategy matters greatly—structured bottom-up methods help weak models by up to 9.4 points, while compositional generation can drop to 1.3%, and the dominant failure causes are logic errors (56.2%) and dependency errors (38.0%).
Related Articles
Vector DB and ANN vs PHE conflict, is there a practical workaround? [D]
Reddit r/MachineLearning

Agent Amnesia and the Case of Henry Molaison
Dev.to

Azure Weekly: Microsoft and OpenAI Restructure Partnership as GPT-5.5 Lands in Foundry
Dev.to

Proven Patterns for OpenAI Codex in 2026: Prompts, Validation, and Gateway Governance
Dev.to

Vibe coding is a tool, not a shortcut. Most people are using it wrong.
Dev.to