Outcome-Aware Tool Selection for Semantic Routers: Latency-Constrained Learning Without LLM Inference

arXiv cs.LG / 3/17/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

OATS (Outcome-Aware Tool Selection) is a method to optimize tool selection in semantic routers for LLM inference gateways, aiming to reduce latency while maintaining or improving accuracy.
The approach operates offline, adding no parameters or serving-time latency, by interpolating tool embeddings toward the centroid of historically successful queries.
Empirical results show NDCG@5 improvements from 0.869 to 0.940 on MetaTool and from 0.834 to 0.848 on ToolBench, evaluated on a held-out 30% test split.
Learned extensions include a 2,625-parameter MLP re-ranker and a 197K-parameter contrastive adapter; the MLP can hurt or match the baseline when data is sparse, while the contrastive adapter provides comparable gains on MetaTool.
The practical takeaway is to start with zero-cost refinement and only add learned components when data density warrants it, with all mechanisms running in single-digit millisecond CPU budgets.

Abstract

Semantic routers in LLM inference gateways select tools in the critical request path, where every millisecond of added latency compounds across millions of requests. We propose Outcome-Aware Tool Selection (OATS), which interpolates tool embeddings toward the centroid of queries where they historically succeed -- an offline process that adds no parameters, latency, or GPU cost at serving time. On MetaTool (199~tools, 4,287~queries), this improves NDCG@5 from 0.869 to 0.940; on ToolBench (2,413~APIs), from 0.834 to 0.848. We also evaluate two learned extensions: a 2,625-parameter MLP re-ranker and a 197K-parameter contrastive adapter. The MLP re-ranker hurts or matches baseline when outcome data is sparse relative to the tool set; the contrastive adapter provides comparable gains on MetaTool (NDCG@5: 0.931). All methods are evaluated on the same held-out 30\% test split. The practical takeaway is to start with the zero-cost refinement and add learned components only when data density warrants it. All mechanisms run within single-digit millisecond CPU budgets.