EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings

arXiv cs.AI / 3/17/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

EnterpriseOps-Gym introduces a containerized sandbox with 164 database tables and 512 functional tools to mimic real-world enterprise search friction for evaluating agentic planning.
It evaluates 1,150 expert-curated tasks across eight mission-critical verticals, including Customer Service, HR, and IT, to test long-horizon planning amid persistent state changes and strict access protocols.
In benchmarks of 14 frontier models, Claude Opus 4.5 achieves only 37.4% success, revealing critical gaps in current enterprise-ready agent capabilities.
The study shows that providing oracle human plans can improve performance by 14-35 percentage points, identifies strategic reasoning as the primary bottleneck, and notes a high rate of infeasible task acceptance (best model 53.9%), underscoring that current agents are not yet ready for autonomous enterprise deployment.
The authors position EnterpriseOps-Gym as a concrete testbed to advance robustness of agentic planning in professional workflows.

Abstract

Large language models are shifting from passive information providers to active agents intended for complex workflows. However, their deployment as reliable AI workers in enterprise is stalled by benchmarks that fail to capture the intricacies of professional environments, specifically, the need for long-horizon planning amidst persistent state changes and strict access protocols. In this work, we introduce EnterpriseOps-Gym, a benchmark designed to evaluate agentic planning in realistic enterprise settings. Specifically, EnterpriseOps-Gym features a containerized sandbox with 164 database tables and 512 functional tools to mimic real-world search friction. Within this environment, agents are evaluated on 1,150 expert-curated tasks across eight mission-critical verticals (including Customer Service, HR, and IT). Our evaluation of 14 frontier models reveals critical limitations in state-of-the-art models: the top-performing Claude Opus 4.5 achieves only 37.4% success. Further analysis shows that providing oracle human plans improves performance by 14-35 percentage points, pinpointing strategic reasoning as the primary bottleneck. Additionally, agents frequently fail to refuse infeasible tasks (best model achieves 53.9%), leading to unintended and potentially harmful side effects. Our findings underscore that current agents are not yet ready for autonomous enterprise deployment. More broadly, EnterpriseOps-Gym provides a concrete testbed to advance the robustness of agentic planning in professional workflows.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/17DailyView insight →

I built an autonomous AI Courtroom using Llama 3.1 8B and CrewAI running 100% locally on my 5070 Ti. The agents debate each other through contextual collaboration.

Reddit r/LocalLLaMA

The Honest Guide to AI Writing Tools in 2026 (What Actually Works)

Dev.to

The Honest Guide to AI Writing Tools in 2026 (What Actually Works)

Dev.to

AI Cybersecurity

Dev.to

Next-Generation LLM Inference Technology: From Flash-MoE to Gemini Flash-Lite, and Local GPU Utilization

Dev.to

EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings

Key Points

Abstract

💡 Insights using this article

Related Articles

I built an autonomous AI Courtroom using Llama 3.1 8B and CrewAI running 100% locally on my 5070 Ti. The agents debate each other through contextual collaboration.

The Honest Guide to AI Writing Tools in 2026 (What Actually Works)

The Honest Guide to AI Writing Tools in 2026 (What Actually Works)

AI Cybersecurity

Next-Generation LLM Inference Technology: From Flash-MoE to Gemini Flash-Lite, and Local GPU Utilization

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer