Meta-Tool: Efficient Few-Shot Tool Adaptation for Small Language Models

arXiv cs.CL / 4/23/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper evaluates Meta-Tool to test whether small language models can achieve strong tool-use performance without complex adaptation, using a Llama-3.2-3B-Instruct backbone.
It compares four approaches—few-shot prompting, documentation encoding, hypernetwork-generated LoRA weights, and value-guided beam search—across Gorilla APIBench, Spider 2.0, WebArena, and InterCode.
The key result is a negative finding: the hypernetwork that generates LoRA weights (227.8M parameters) shows no measurable improvement over few-shot prompting.
Ablations indicate few-shot examples add +21.5% performance and documentation adds +5.0%, while the hypernetwork contributes 0%, and a well-prompted 3B model reaches 79.7% of GPT-5’s average performance at 10× lower latency.
Error analysis over 722 failure cases shows task-dependent failure modes: schema-heavy benchmarks have mainly semantic issues, while Gorilla and InterCode are dominated by format errors, reinforcing prompt engineering and example curation over complex adaptation.

Abstract

Can small language models achieve strong tool-use performance without complex adaptation mechanisms? This paper investigates this question through Meta-Tool, a controlled empirical study comparing hypernetwork-based LoRA adaptation against carefully designed few-shot prompting. Using a Llama-3.2-3B-Instruct backbone, we evaluate four adaptation mechanisms--few-shot prompting, documentation encoding, hypernetwork-generated LoRA weights, and value-guided beam search--across four diverse benchmarks: Gorilla APIBench, Spider 2.0, WebArena, and InterCode. Our central finding is a well-supported negative result: despite generating non-trivial weight matrices, the 227.8M-parameter hypernetwork provides no measurable improvement over few-shot prompting alone. Comprehensive ablation studies reveal that few-shot examples contribute +21.5% to performance and documentation contributes +5.0%, while the hypernetwork adds 0%. A 3B model with well-designed prompts achieves 79.7% of GPT-5's average performance at

10 \times

lower latency. Error analysis across 722 failure cases spanning all shot counts (0--5) shows that at the 5-shot configuration (106 failures), failure modes are task-dependent: schema-heavy tasks (Spider 2.0, WebArena) show near-zero format errors with remaining failures semantic, while format errors dominate on Gorilla (100%) and InterCode (70%). These findings redirect practitioners toward prompt engineering and example curation rather than complex adaptation architectures.

Just what the doctor ordered: how AI could help China bridge the medical resources gap

SCMP Tech

Why don't Automatic speech Recognition models use prompting? [D]

Reddit r/MachineLearning

Automating Advanced Customization in Your Music Studio

Dev.to

CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos

Dev.to

My AI Agent Over-Corrected Itself — So I Built Metabolic Regulation

Dev.to

Meta-Tool: Efficient Few-Shot Tool Adaptation for Small Language Models

Key Points

Abstract

Related Articles

Just what the doctor ordered: how AI could help China bridge the medical resources gap

Why don't Automatic speech Recognition models use prompting? [D]

Automating Advanced Customization in Your Music Studio

CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos

My AI Agent Over-Corrected Itself — So I Built Metabolic Regulation

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer