Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement

arXiv cs.CL / 4/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses how to train automatic judges for LLM-generated business ideas when human experts disagree, asking whether judges should mimic an aggregate consensus or individual evaluator judgments.
  • It introduces PBIG-DATA, a dataset of about 3,000 expert score entries across 300 patent-grounded product ideas, covering six business evaluation dimensions with structured (not purely random) evaluator disagreement.
  • Experiments compare three judge setups—rubric-only zero-shot, an aggregate judge using mixed evaluator histories, and a personalized judge using a specific evaluator’s scoring history—and find that personalized judges match the targeted evaluator more closely.
  • The study also shows that evaluator agreement relates to similarity of judge-generated reasoning only when using personalized conditioning, implying that pooled labels may be fragile in pluralistic assessment.
  • Overall, the results motivate evaluator-conditioned (personalized) judging approaches for business idea assessment instead of relying on pooled consensus labels.

Abstract

Evaluating LLM-generated business ideas is often harder to scale than generating them. Unlike standard NLP benchmarks, business idea evaluation relies on multi-dimensional criteria such as feasibility, novelty, differentiation, user need, and market size, and expert judgments often disagree. This paper studies a methodological question raised by such disagreement: should an automatic judge approximate an aggregate consensus, or model evaluators individually? We introduce PBIG-DATA, a dataset of approximately 3,000 individual scores across 300 patent-grounded product ideas, provided by domain experts on six business-oriented dimensions: specificity, technical validity, innovativeness, competitive advantage, need validity, and market size. Analyses show substantial expert disagreement on fine-grained ordinal scores, while agreement is higher under coarse selection, suggesting structured heterogeneity rather than random noise. We then compare three judge configurations: a rubric-only zero-shot judge, an aggregate judge conditioned on mixed evaluator histories, and a personalized judge conditioned on the target evaluator's scoring history. Across dimensions and model sizes, personalized judges align more closely with the corresponding evaluator than aggregate judges, and evaluator agreement correlates with similarity of judge-generated reasoning only under personalized conditioning. These results indicate that pooled labels can be a fragile target in pluralistic evaluation settings and motivate evaluator-conditioned judge designs for business idea assessment.