Gym-Anything: Turn any Software into an Agent Environment

arXiv cs.LG / 4/8/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

Gym-Anything is presented as a framework that turns essentially any software into an interactive computer-use “agent environment,” aiming to remove the scalability bottleneck of manual environment creation.
The paper proposes a multi-agent environment-setup pipeline where a coding agent generates setup scripts, installs/configures the target software with real-world data, and produces evidence, while an independent audit agent verifies setup quality against a checklist.
Using an occupation taxonomy tied to U.S. GDP, the authors build CUA-World from 200 software applications, yielding 10K+ long-horizon tasks with realistic data plus train/test splits across domains like medical science, astronomy, engineering, and enterprise systems.
They also introduce CUA-World-Long, a tougher benchmark with tasks often exceeding 500 steps, and show that distilling successful trajectories into a 2B vision-language model improves performance while audit-time review further boosts Gemini-3-Flash results from 11.5% to 14.0%.
All code, infrastructure, and benchmark data are released to support future research on realistic long-horizon computer-use agents.

Abstract

Computer-use agents hold the promise of assisting in a wide range of digital economic activities. However, current research has largely focused on short-horizon tasks over a limited set of software with limited economic value, such as basic e-commerce and OS-configuration tasks. A key reason is that creating environments for complex software requires significant time and human effort, and therefore does not scale. To address this, we introduce Gym-Anything, a framework for converting any software into an interactive computer-use environment. We frame environment creation itself as a multi-agent task: a coding agent writes setup scripts, downloads real-world data, and configures the software, while producing evidence of correct setup. An independent audit agent then verifies evidence for the environment setup against a quality checklist. Using a taxonomy of economically valuable occupations grounded in U.S. GDP data, we apply this pipeline to 200 software applications with broad occupational coverage. The result is CUA-World, a collection of over 10K long-horizon tasks spanning domains from medical science and astronomy to engineering and enterprise systems, each configured with realistic data along with train and test splits. CUA-World also includes CUA-World-Long, a challenging long-horizon benchmark with tasks often requiring over 500 steps, far exceeding existing benchmarks. Distilling successful trajectories from the training split into a 2B vision-language model outperforms models 2

\times

its size. We also apply the same auditing principle at test time: a separate VLM reviews completed trajectories and provides feedback on what remains, improving Gemini-3-Flash on CUA-World-Long from 11.5% to 14.0%. We release all code, infrastructure, and benchmark data to facilitate future research in realistic computer-use agents.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/8DailyView insight →

Black Hat USA

AI Business

Black Hat Asia

AI Business

v0.20.5

Ollama Releases

Inside Anthropic's Project Glasswing: The AI Model That Found Zero-Days in Every Major OS

Dev.to

Gemma 4 26B fabricated an entire code audit. I have the forensic evidence from the database.

Reddit r/LocalLLaMA

Gym-Anything: Turn any Software into an Agent Environment

Key Points

Abstract

💡 Insights using this article

Related Articles

Black Hat USA

Black Hat Asia

v0.20.5

Inside Anthropic's Project Glasswing: The AI Model That Found Zero-Days in Every Major OS

Gemma 4 26B fabricated an entire code audit. I have the forensic evidence from the database.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer