Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning

arXiv cs.AI / 4/14/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces COVERT, a two-stage synthetic data pipeline aimed at producing tool-use trajectories that are compatible with reinforcement learning by enabling reward-checkable online rollouts.
COVERT first generates base trajectories via self-evolving synthesis with multi-level validation, ensuring reliability before RL training.
It then performs oracle-preserving augmentations that raise task difficulty (e.g., distractor tools, ambiguous queries, noisy or erroneous tool outputs) while strictly keeping the oracle tool calls and final answers as ground truth.
The approach supports automatic reward computation via reference matching for standard cases and uses lightweight judge-assisted verification for special behaviors like error detection.
Experiments on Qwen2.5-Instruct-14B show improved tool-use accuracy on BFCL v3 (56.5→59.9) and ACEBench (53.0→59.3), with additional gains when stacked on SFT and minimal regressions on general-ability benchmarks.

Abstract

Existing synthetic tool-use corpora are primarily designed for offline supervised fine-tuning, yet reinforcement learning (RL) requires executable environments that support reward-checkable online rollouts. We propose COVERT, a two-stage pipeline that first generates reliable base tool-use trajectories through self-evolving synthesis with multi-level validation, and then applies oracle-preserving augmentations that systematically increase environmental complexity. These augmentations introduce distractor tools, indirect or ambiguous user queries, and noisy, multi-format, or erroneous tool outputs, while strictly preserving oracle tool calls and final answers as ground truth. This design enables automatic reward computation via reference matching for standard cases and lightweight judge-assisted verification for special behaviors such as error detection, supporting RL optimization of tool-calling policies. On Qwen2.5-Instruct-14B, COVERT-RL improves overall accuracy on BFCL v3 from 56.5 to 59.9 and on ACEBench from 53.0 to 59.3, with minimal regressions on general-ability benchmarks; when stacked on SFT, it further reaches 62.1 and 61.8, confirming additive gains. These results suggest that oracle-preserving synthetic environments offer a practical RL refinement stage, complementary to SFT, for improving tool-use robustness under ambiguity and unreliable tool feedback.