EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings
arXiv cs.AI / 3/17/2026
📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- EnterpriseOps-Gym introduces a containerized sandbox with 164 database tables and 512 functional tools to mimic real-world enterprise search friction for evaluating agentic planning.
- It evaluates 1,150 expert-curated tasks across eight mission-critical verticals, including Customer Service, HR, and IT, to test long-horizon planning amid persistent state changes and strict access protocols.
- In benchmarks of 14 frontier models, Claude Opus 4.5 achieves only 37.4% success, revealing critical gaps in current enterprise-ready agent capabilities.
- The study shows that providing oracle human plans can improve performance by 14-35 percentage points, identifies strategic reasoning as the primary bottleneck, and notes a high rate of infeasible task acceptance (best model 53.9%), underscoring that current agents are not yet ready for autonomous enterprise deployment.
- The authors position EnterpriseOps-Gym as a concrete testbed to advance robustness of agentic planning in professional workflows.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

NVIDIA、GTC 2026で次世代AI基盤を発表 「Vera Rubin」を軸にエージェント・ゲーム・宇宙領域へ展開のサムネイル画像
Ledge.ai

1Password、AIエージェントのアクセス制御を統合管理する「Unified Access」発表 人間・マシン・AIの資格情報を一元統制のサムネイル画像
Ledge.ai

『モンドーモンドー』|夏目龍頭流闇文学|AI画像生成|自由詩|散文詩|ホラー|ダークファンタジー|深淵図書館
note

報告:LLMにおける「自己言及的再帰」と「ステートフル・エミュレーション」の観測
note

「お金、見直したいけどどこから?」AIが改善ヒントを教えてくれる、公式プロンプトを公開
note