Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement
arXiv cs.CL / 4/27/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper addresses how to train automatic judges for LLM-generated business ideas when human experts disagree, asking whether judges should mimic an aggregate consensus or individual evaluator judgments.
- It introduces PBIG-DATA, a dataset of about 3,000 expert score entries across 300 patent-grounded product ideas, covering six business evaluation dimensions with structured (not purely random) evaluator disagreement.
- Experiments compare three judge setups—rubric-only zero-shot, an aggregate judge using mixed evaluator histories, and a personalized judge using a specific evaluator’s scoring history—and find that personalized judges match the targeted evaluator more closely.
- The study also shows that evaluator agreement relates to similarity of judge-generated reasoning only when using personalized conditioning, implying that pooled labels may be fragile in pluralistic assessment.
- Overall, the results motivate evaluator-conditioned (personalized) judging approaches for business idea assessment instead of relying on pooled consensus labels.
Related Articles

Legal Insight Transformation: 7 Mistakes to Avoid When Adopting AI Tools
Dev.to

Legal Insight Transformation: Traditional vs. AI-Driven Research Compared
Dev.to

Legal Insight Transformation: A Beginner's Guide to Modern Research
Dev.to
I tested the same prompt across multiple AI models… the differences surprised me
Reddit r/artificial

The five loops between AI coding and AI engineering
Dev.to