ClawGym: A Scalable Framework for Building Effective Claw Agents

arXiv cs.CL / 4/30/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces ClawGym, a scalable end-to-end framework for developing “Claw-style” personal agents that operate over local files, tools, and persistent workspace state.
  • It also releases ClawGym-SynData, a dataset of 13.5K filtered tasks generated from persona-driven intents and skill-grounded operations, including realistic mock workspaces with hybrid verification.
  • Using this data, the authors train a suite of ClawGym-Agents via supervised fine-tuning on black-box rollout trajectories and add an optional lightweight reinforcement-learning pipeline with parallelized rollouts.
  • For evaluation, the work proposes ClawGym-Bench with 200 benchmark instances created via automated filtering plus human–LLM review.
  • The authors plan to share resources soon on GitHub, aiming to improve reproducible training data synthesis and diagnostic evaluation for such agents.

Abstract

Claw-style environments support multi-step workflows over local files, tools, and persistent workspace states. However, scalable development around these environments remains constrained by the absence of a systematic framework, especially one for synthesizing verifiable training data and integrating it with agent training and diagnostic evaluation. To address this challenge, we present ClawGym, a scalable framework that supports the full lifecycle of Claw-style personal agent development. Concretely, we construct ClawGym-SynData, a diverse dataset of 13.5K filtered tasks synthesized from persona-driven intents and skill-grounded operations, paired with realistic mock workspaces and hybrid verification mechanisms. We then train a family of capable Claw-style models, termed ClawGym-Agents, through supervised fine-tuning on black-box rollout trajectories, and further explore reinforcement learning via a lightweight pipeline that parallelizes rollouts across per-task sandboxes.To support reliable evaluation, we further construct ClawGym-Bench, a benchmark of 200 instances calibrated through automated filtering and human-LLM review. Relevant resources will be soon released at https://github.com/ClawGym.