KWBench: Measuring Unprompted Problem Recognition in Knowledge Work

arXiv cs.AI / 4/20/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces KWBench (Knowledge Work Bench), a new benchmark that tests whether large language models can recognize a professional scenario and its governing structure from raw inputs without being told the problem type.
Unlike prior knowledge-work evaluations that focus on extraction or task completion against a specification, KWBench targets the “unprompted problem recognition” step by using formal game-theoretic patterns and expert-annotated failure modes.
KWBench includes 223 practitioner-sourced tasks spanning domains such as acquisitions, contract negotiations, clinical pharmacy, organizational politics, fraud analysis, and incentive design, with models given only raw data and a generic prompt.
Experiments on 16 models show low overall pass rates (best model: 27.9%), limited agreement between top models on their passes, and the finding that even when models can name the correct game-theoretic concept, they often fail to apply it unprompted.
The authors release the benchmark to reshape evaluation of frontier LLMs in knowledge work by scoring recognition of the right problem from the situation alone, not just execution after framing.
categories: ["models-research", "ideas-deep-analysis"]

Abstract

We introduce the first version of KWBench (Knowledge Work Bench), a benchmark for unprompted problem recognition in large language models: can an LLM identify a professional scenario before attempting to solve it. Existing frontier benchmarks have saturated, and most knowledge-work evaluations to date reduce to extraction or task completion against a specification. KWBench targets the step before that: recognizing the governing structure of the situation from raw inputs alone. The benchmark contains 223 tasks sourced from practitioners across acquisitions, contract negotiations, clinical pharmacy, organizational politics, fraud analysis, and incentive design. Each task encodes a formal game-theoretic pattern (principal-agent conflict, signaling, mechanism design failure, strategic omission, coalitional dynamics, strategic interdependence) and carries structured ground truth recording the expert reading of the situation and the anticipated failure modes. Models receive raw data and a task prompt with no indication of problem type. Scoring is a three-tier rubric gated by a mandatory conjunctive check. Mandatory criteria encode the predicted wrong paths. We evaluate 16 models. The best model passes on 27.9% of tasks. The top two models agree on only 31.7% of their passes. Among the top 8, 44 tasks are solved by exactly one model; routing across the top 8 covers 50.7% of the benchmark, nearly double the best single model. Conditional on passing, quality scores converge (approx 83% across models); unconditional scores do not. Same models articulate the relevant game-theoretic concept correctly when asked, then fail to apply it unprompted. We release KWBench to shift how frontier models are evaluated on knowledge work, scoring them on whether they recognize the right problem from the situation alone, not only on how well they execute once the problem has been framed for them.

From Theory to Reality: Why Most AI Agent Projects Fail (And How Mine Did Too)

Dev.to

GPT-5.4-Cyber: OpenAI's Game-Changer for AI Security and Defensive AI

Dev.to

Building Digital Souls: The Brutal Reality of Creating AI That Understands You Like Nobody Else

Dev.to

Local LLM Beginner’s Guide (Mac - Apple Silicon)

Reddit r/artificial

Is Your Skill Actually Good? Systematically Validating Agent Skills with Evals

Dev.to

KWBench: Measuring Unprompted Problem Recognition in Knowledge Work

Key Points

Abstract

Related Articles

From Theory to Reality: Why Most AI Agent Projects Fail (And How Mine Did Too)

GPT-5.4-Cyber: OpenAI's Game-Changer for AI Security and Defensive AI

Building Digital Souls: The Brutal Reality of Creating AI That Understands You Like Nobody Else

Local LLM Beginner’s Guide (Mac - Apple Silicon)

Is Your Skill Actually Good? Systematically Validating Agent Skills with Evals

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer