Meta-Tool: Efficient Few-Shot Tool Adaptation for Small Language Models

arXiv cs.CL / 4/23/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper evaluates Meta-Tool to test whether small language models can achieve strong tool-use performance without complex adaptation, using a Llama-3.2-3B-Instruct backbone.
  • It compares four approaches—few-shot prompting, documentation encoding, hypernetwork-generated LoRA weights, and value-guided beam search—across Gorilla APIBench, Spider 2.0, WebArena, and InterCode.
  • The key result is a negative finding: the hypernetwork that generates LoRA weights (227.8M parameters) shows no measurable improvement over few-shot prompting.
  • Ablations indicate few-shot examples add +21.5% performance and documentation adds +5.0%, while the hypernetwork contributes 0%, and a well-prompted 3B model reaches 79.7% of GPT-5’s average performance at 10× lower latency.
  • Error analysis over 722 failure cases shows task-dependent failure modes: schema-heavy benchmarks have mainly semantic issues, while Gorilla and InterCode are dominated by format errors, reinforcing prompt engineering and example curation over complex adaptation.

Abstract

Can small language models achieve strong tool-use performance without complex adaptation mechanisms? This paper investigates this question through Meta-Tool, a controlled empirical study comparing hypernetwork-based LoRA adaptation against carefully designed few-shot prompting. Using a Llama-3.2-3B-Instruct backbone, we evaluate four adaptation mechanisms--few-shot prompting, documentation encoding, hypernetwork-generated LoRA weights, and value-guided beam search--across four diverse benchmarks: Gorilla APIBench, Spider 2.0, WebArena, and InterCode. Our central finding is a well-supported negative result: despite generating non-trivial weight matrices, the 227.8M-parameter hypernetwork provides no measurable improvement over few-shot prompting alone. Comprehensive ablation studies reveal that few-shot examples contribute +21.5% to performance and documentation contributes +5.0%, while the hypernetwork adds 0%. A 3B model with well-designed prompts achieves 79.7% of GPT-5's average performance at 10 \times lower latency. Error analysis across 722 failure cases spanning all shot counts (0--5) shows that at the 5-shot configuration (106 failures), failure modes are task-dependent: schema-heavy tasks (Spider 2.0, WebArena) show near-zero format errors with remaining failures semantic, while format errors dominate on Gorilla (100%) and InterCode (70%). These findings redirect practitioners toward prompt engineering and example curation rather than complex adaptation architectures.