Qworld: Question-Specific Evaluation Criteria for LLMs
arXiv cs.CL / 3/26/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that evaluating LLM answers to open-ended questions requires context-dependent criteria, since simple binary scoring or static rubrics cannot capture question-specific requirements.
- It introduces Qworld (One-Question-One-World), which generates question-specific evaluation criteria via a recursive expansion tree that decomposes questions into scenarios, perspectives, and fine-grained binary criteria.
- On HealthBench, Qworld is reported to cover 89% of expert-authored criteria, while generating 79% novel criteria that human experts validate, with higher insight and granularity than prior methods.
- Applying Qworld to 11 frontier LLMs across HealthBench and Humanity’s Last Exam shows that coarse rubrics miss capability differences, such as long-term impact, equity, error handling, and interdisciplinary reasoning.
- The core contribution is framing criteria generation as structured coverage of evaluation axes implied by each question, enabling adaptive evaluation rather than fixed task-level rubrics.
Related Articles
Regulating Prompt Markets: Securities Law, Intellectual Property, and the Trading of Prompt Assets
Dev.to
Mercor competitor Deccan AI raises $25M, sources experts from India
Dev.to
How We Got Local MCP Servers Working in Claude Cowork (The Missing Guide)
Dev.to
How Should Students Document AI Usage in Academic Work?
Dev.to
They Did Not Accidentally Make Work the Answer to Who You Are
Dev.to