Healthcare AI GYM for Medical Agents

arXiv cs.LG / 5/6/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces a unified, Gymnasium-compatible RL training environment for medical AI agents, covering 10 clinical domains, 3.6K+ tasks, 135 clinical tools, and an 828K-passage knowledge base.
  • It finds that naïve multi-turn agentic RL can collapse into overly long, verbose single-turn “monologues,” marked by monotonic response-length growth and reduced tool usage.
  • The authors attribute the collapse and related distillation instability to misalignment between sparse terminal rewards and the sequential nature of clinical trajectories.
  • While vanilla GRPO can reach strong benchmark accuracy in some cases, it shows training instability, including oscillations in response length and long convergence times.
  • To address these issues, the authors propose Turn-level Truncated On-Policy Distillation (TT-OPD), which uses an EMA teacher with outcome-privileged information to add dense, turn-by-turn KL regularization, improving results on 10/18 benchmarks by an average +3.9 percentage points and improving convergence speed and multi-turn tool use.

Abstract

Clinical reasoning demands multi-step interactions -- gathering patient history, ordering tests, interpreting results, and making safe treatment decisions -- yet a unified training environment provides the breadth of clinical domains and specialized tools to train generalizable medical AI agents through reinforcement learning remains elusive. We present a comprehensive empirical study of multi-turn agentic RL for medical AI, built on \gym{}, a gymnasium-compatible environment spanning 10 clinical domains with 3.6K+ tasks, 135 domain-specific tools, and a knowledge base of 828K medical passages. Our analysis reveals that agentic multi-turn structure degrades into verbose single-turn monologues, characterized by monotonic length explosion and a simultaneous erosion of tool-use frequency. We characterize how this collapse, alongside distillation instability, stems from the misalignment of sparse terminal rewards with sequential clinical trajectories. We find that vanilla GRPO achieves strong final accuracy on some benchmarks but suffers from training instability, evidenced by significant oscillations in response length and prolonged convergence periods. To improve training efficiency and stability, we propose Turn-level Truncated On-Policy Distillation (TT-OPD), a self-distillation framework where a gradient-free EMA teacher leverages outcome-privileged information to provide dense, outcome-aware KL regularization at every conversation turn. TT-OPD achieves the best performance on 10 of 18 benchmarks with an average +3.9~pp improvement over the non-RL baseline with faster early convergence, controlled response length, and sustained multi-turn tool use.