Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning

arXiv cs.AI / 4/14/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces COVERT, a two-stage synthetic data pipeline aimed at producing tool-use trajectories that are compatible with reinforcement learning by enabling reward-checkable online rollouts.
  • COVERT first generates base trajectories via self-evolving synthesis with multi-level validation, ensuring reliability before RL training.
  • It then performs oracle-preserving augmentations that raise task difficulty (e.g., distractor tools, ambiguous queries, noisy or erroneous tool outputs) while strictly keeping the oracle tool calls and final answers as ground truth.
  • The approach supports automatic reward computation via reference matching for standard cases and uses lightweight judge-assisted verification for special behaviors like error detection.
  • Experiments on Qwen2.5-Instruct-14B show improved tool-use accuracy on BFCL v3 (56.5→59.9) and ACEBench (53.0→59.3), with additional gains when stacked on SFT and minimal regressions on general-ability benchmarks.

Abstract

Existing synthetic tool-use corpora are primarily designed for offline supervised fine-tuning, yet reinforcement learning (RL) requires executable environments that support reward-checkable online rollouts. We propose COVERT, a two-stage pipeline that first generates reliable base tool-use trajectories through self-evolving synthesis with multi-level validation, and then applies oracle-preserving augmentations that systematically increase environmental complexity. These augmentations introduce distractor tools, indirect or ambiguous user queries, and noisy, multi-format, or erroneous tool outputs, while strictly preserving oracle tool calls and final answers as ground truth. This design enables automatic reward computation via reference matching for standard cases and lightweight judge-assisted verification for special behaviors such as error detection, supporting RL optimization of tool-calling policies. On Qwen2.5-Instruct-14B, COVERT-RL improves overall accuracy on BFCL v3 from 56.5 to 59.9 and on ACEBench from 53.0 to 59.3, with minimal regressions on general-ability benchmarks; when stacked on SFT, it further reaches 62.1 and 61.8, confirming additive gains. These results suggest that oracle-preserving synthetic environments offer a practical RL refinement stage, complementary to SFT, for improving tool-use robustness under ambiguity and unreliable tool feedback.