Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards

arXiv cs.LG / 3/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper tackles the difficulty of training LLMs to perform multi-step tool orchestration, where outputs from one API call must correctly feed into dependent subsequent calls.
It introduces an RL training framework using a large cache of real API responses to generate valid, controllably complex multi-step traces with substantially better efficiency than unconstrained synthesis.
It proposes a graduated reward scheme that provides learning signal for both atomic validity (correctness of individual function calls at increasing granularity) and orchestration correctness (proper tool sequencing that respects dependencies).
Experiments on ComplexFuncBench show substantial gains in turn accuracy, and ablation studies indicate that both reward components are required for best performance.

Abstract

Multi-step tool orchestration, where LLMs must invoke multiple dependent APIs in the correct order while propagating intermediate outputs, remains challenging. State-of-the-art models frequently fail on full sequence execution, with parameter value errors accounting for a significant portion of failures. Training models to handle such workflows faces two obstacles: existing environments focus on simple per-turn function calls with simulated data, and binary rewards provide no signal for partial correctness. We present a framework addressing both challenges. First, we construct a reinforcement learning environment backed by a large-scale cache of real API responses, enabling a data synthesis pipeline that samples valid multi-step orchestration traces with controllable complexity and significantly higher generation efficiency than unconstrained methods. Second, we propose a graduated reward design that decomposes correctness into atomic validity (individual function call correctness at increasing granularity) and orchestration (correct tool sequencing with dependency respect). On ComplexFuncBench, our approach demonstrates substantial improvements in turn accuracy. Ablation studies confirm both reward components are essential: using either alone significantly degrades performance.

Built a mortgage OCR system that hit 100% final accuracy in production (US/UK underwriting)

Reddit r/LocalLLaMA

# I Created a Pagination Challenge… And AI Missed the Real Problem

Dev.to

The Real Stack Behind AI Agents in Production — MCP, Kubernetes, and What Nobody Tells You

Dev.to

The Rise of Agent AI and Revolutionary Business Process Automation

Dev.to

The Real AI Agent Failure Mode Is Uncertain Completion

Dev.to

Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards

Key Points

Abstract

Related Articles

Built a mortgage OCR system that hit 100% final accuracy in production (US/UK underwriting)

# I Created a Pagination Challenge… And AI Missed the Real Problem

The Real Stack Behind AI Agents in Production — MCP, Kubernetes, and What Nobody Tells You

The Rise of Agent AI and Revolutionary Business Process Automation

The Real AI Agent Failure Mode Is Uncertain Completion

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer