Automatic Generation of High-Performance RL Environments

arXiv cs.LG / 3/13/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The article proposes a reusable recipe that combines a generic prompt template, hierarchical verification, and iterative agent-assisted repair to generate semantically equivalent high-performance RL environments at under $10 in compute cost.
It demonstrates three workflows across five environments, including EmuRust achieving a 1.5x PPO speedup and PokeJAX as the first GPU-parallel Pokemon battle simulator with 500M SPS random actions and 15.2M SPS PPO.
The results show throughput parity or improvements against existing implementations (MJX 1.04x, Brax 5x at matched GPU batch sizes, and 42x PPO on Puffer Pong) and introduce TCGJax, a deployable JAX Pokemon TCG engine with low overhead.
Hierarchical verification yields semantic equivalence and zero sim-to-sim gap across all five environments, and the work discusses contamination-control aspects for agent pretraining data.

Abstract

Translating complex reinforcement learning (RL) environments into high-performance implementations has traditionally required months of specialized engineering. We present a reusable recipe - a generic prompt template, hierarchical verification, and iterative agent-assisted repair - that produces semantically equivalent high-performance environments for <$10 in compute cost. We demonstrate three distinct workflows across five environments. Direct translation (no prior performance implementation exists): EmuRust (1.5x PPO speedup via Rust parallelism for a Game Boy emulator) and PokeJAX, the first GPU-parallel Pokemon battle simulator (500M SPS random action, 15.2M SPS PPO; 22,320x over the TypeScript reference). Translation verified against existing performance implementations: throughput parity with MJX (1.04x) and 5x over Brax at matched GPU batch sizes (HalfCheetah JAX); 42x PPO (Puffer Pong). New environment creation: TCGJax, the first deployable JAX Pokemon TCG engine (717K SPS random action, 153K SPS PPO; 6.6x over the Python reference), synthesized from a web-extracted specification. At 200M parameters, the environment overhead drops below 4% of training time. Hierarchical verification (property, interaction, and rollout tests) confirms semantic equivalence for all five environments; cross-backend policy transfer confirms zero sim-to-sim gap for all five environments. TCGJax, synthesized from a private reference absent from public repositories, serves as a contamination control for agent pretraining data concerns. The paper contains sufficient detail - including representative prompts, verification methodology, and complete results - that a coding agent could reproduce the translations directly from the manuscript.