PORTool: Importance-Aware Policy Optimization with Rewarded Tree for Multi-Tool-Integrated Reasoning

arXiv cs.CL / 5/4/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • PORTool is an importance-aware policy-optimization method for multi-tool integrated reasoning agents that helps address credit-assignment ambiguity from outcome-only rewards.
  • The approach builds a rewarded rollout tree where trajectories share prefixes, allowing step-level comparisons of different tool-use decisions within the same context.
  • PORTool estimates the importance of each step using a correctness-dominant signal (whether descendants can reach a correct final answer) plus an auxiliary term for tool-call formatting and successful execution.
  • Experiments on tool-use reasoning show higher final-answer accuracy and fewer tool-call steps than existing policy-optimization baselines, with ablations supporting the robustness of the step-wise importance estimates.

Abstract

Multi-tool-integrated reasoning enables LLM-empowered tool-use agents to solve complex tasks by interleaving natural-language reasoning with calls to external tools. However, training such agents from outcome-only rewards suffers from credit-assignment ambiguity, obscuring which intermediate tool-use decisions drive success or failure. In this paper, we propose PORTool, an importance-aware policy-optimization algorithm that reinforces agents' tool-use competence from outcome-level supervision while assigning reward at the step level. Specifically, PORTool generates a rewarded rollout tree in which trajectories share prefixes before branching, enabling direct comparisons among alternative tool-use decisions within the same context. It then estimates each step's importance by a correctness-dominant signal, i.e., whether descendants of that step can ultimately produce a correct final answer, plus an auxiliary term indicating whether the step's tool calls satisfy formatting constraints and execute successfully. Using these step-wise importance estimates, PORTool updates the policy to generate efficient tool-call steps, guided by both local comparisons within each branching decision and the overall quality of entire trajectories. Experiments show that PORTool improves final-answer accuracy while reducing tool-call steps compared with state-of-the-art policy-optimization baselines, and ablation studies confirm the robustness of the proposed step-wise importance estimates.