OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models

arXiv cs.CL / 4/14/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

OccuBenchを提案し、現行ベンチマークが不足していた「多様な職業領域の現実的な専門タスク」を対象に、10業界カテゴリ・65専門領域・計100シナリオをカバーする評価ベンチを構築した。
Language World Models（LWM）により、LLMがツール応答を生成して領域固有の環境をシミュレートし、保証された解けやすさ・調整済み難易度・文書に根差した多様性を備える評価インスタンスを自動生成する。
評価は(1)職業ドメイン横断でのタスク達成と、(2)故障注入下での環境頑健性の2軸で行い、明示的エラーよりも暗黙的なデータ劣化（欠損・切り詰め等）の方が難しいことを示した。
15のフロンティアモデルを8系統で比較した結果、単一モデルが全産業で支配的になるわけではなく、モデル規模の大きさ・世代の新しさ・推論努力量の増加が一貫して性能向上に寄与する。
強力なエージェントは必ずしも強力な環境シミュレータにはならず、LWMベース評価の信頼性にはシミュレータ品質が決定的だと結論づけた。

Abstract

AI agents are expected to perform professional work across hundreds of occupational domains (from emergency department triage to nuclear reactor safety monitoring to customs import processing), yet existing benchmarks can only evaluate agents in the few domains where public environments exist. We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language World Models (LWMs) that simulate domain-specific environments through LLM-driven tool response generation. Our multi-agent synthesis pipeline automatically produces evaluation instances with guaranteed solvability, calibrated difficulty, and document-grounded diversity. OccuBench evaluates agents along two complementary dimensions: task completion across professional domains and environmental robustness under controlled fault injection (explicit errors, implicit data degradation, and mixed faults). We evaluate 15 frontier models across 8 model families and find that: (1) no single model dominates all industries, as each has a distinct occupational capability profile; (2) implicit faults (truncated data, missing fields) are harder than both explicit errors (timeouts, 500s) and mixed faults, because they lack overt error signals and require the agent to independently detect data degradation; (3) larger models, newer generations, and higher reasoning effort consistently improve performance. GPT-5.2 improves by 27.5 points from minimal to maximum reasoning effort; and (4) strong agents are not necessarily strong environment simulators. Simulator quality is critical for LWM-based evaluation reliability. OccuBench provides the first systematic cross-industry evaluation of AI agents on professional occupational tasks.