Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation
arXiv cs.AI / 4/6/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces XpertBench, a rubrics-based benchmark with 1,346 expert-level tasks across 80 categories intended to better evaluate LLM performance on complex open-ended professional work.
- XpertBench draws tasks from 1,000+ expert submissions across domains such as finance, healthcare, legal services, education, and dual-track research, aiming for higher ecological validity than conventional benchmarks.
- Each task is scored using detailed rubrics with mostly 15–40 weighted checkpoints to measure professional rigor and reduce ambiguity in evaluation.
- The authors propose ShotJudge, an evaluation paradigm that uses calibrated LLM judges with expert few-shot exemplars to mitigate self-rewarding evaluation biases.
- Experiments show current leading LLMs face an “expert-gap,” with a reported peak success rate of ~66% and mean scores around 55%, along with noticeable domain-specific strengths and weaknesses.
Related Articles

Black Hat Asia
AI Business

How Bash Command Safety Analysis Works in AI Systems
Dev.to

How I Built an AI Agent That Earns USDC While I Sleep — A Complete Guide
Dev.to

How to Get Better Output from AI Tools (Without Burning Time and Tokens)
Dev.to

How I Added LangChain4j Without Letting It Take Over My Spring Boot App
Dev.to