KWBench: Measuring Unprompted Problem Recognition in Knowledge Work
arXiv cs.AI / 4/20/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces KWBench (Knowledge Work Bench), a new benchmark that tests whether large language models can recognize a professional scenario and its governing structure from raw inputs without being told the problem type.
- Unlike prior knowledge-work evaluations that focus on extraction or task completion against a specification, KWBench targets the “unprompted problem recognition” step by using formal game-theoretic patterns and expert-annotated failure modes.
- KWBench includes 223 practitioner-sourced tasks spanning domains such as acquisitions, contract negotiations, clinical pharmacy, organizational politics, fraud analysis, and incentive design, with models given only raw data and a generic prompt.
- Experiments on 16 models show low overall pass rates (best model: 27.9%), limited agreement between top models on their passes, and the finding that even when models can name the correct game-theoretic concept, they often fail to apply it unprompted.
- The authors release the benchmark to reshape evaluation of frontier LLMs in knowledge work by scoring recognition of the right problem from the situation alone, not just execution after framing.
- categories: ["models-research", "ideas-deep-analysis"]
Related Articles

From Theory to Reality: Why Most AI Agent Projects Fail (And How Mine Did Too)
Dev.to

GPT-5.4-Cyber: OpenAI's Game-Changer for AI Security and Defensive AI
Dev.to

Building Digital Souls: The Brutal Reality of Creating AI That Understands You Like Nobody Else
Dev.to
Local LLM Beginner’s Guide (Mac - Apple Silicon)
Reddit r/artificial

Is Your Skill Actually Good? Systematically Validating Agent Skills with Evals
Dev.to