FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Protocol

arXiv cs.AI / 3/27/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper presents FinMCP-Bench, a new benchmark designed to evaluate LLM agents on real-world financial problems by using financial Tool/Model Context Protocol (MCP) tool invocation.
  • The benchmark includes 613 samples across 10 scenarios and 33 sub-scenarios, mixing real and synthetic queries to balance diversity with authenticity.
  • FinMCP-Bench tests models with 65 real financial MCPs and supports multiple complexity modes, including single-tool, multi-tool, and multi-turn evaluations.
  • The authors assess a variety of mainstream LLMs and introduce metrics focused on tool-invocation accuracy as well as reasoning performance.
  • Overall, the benchmark is positioned as a standardized, practical testbed to advance research and development of financial LLM agents under MCP-based tool use.

Abstract

This paper introduces \textbf{FinMCP-Bench}, a novel benchmark for evaluating large language models (LLMs) in solving real-world financial problems through tool invocation of financial model context protocols. FinMCP-Bench contains 613 samples spanning 10 main scenarios and 33 sub-scenarios, featuring both real and synthetic user queries to ensure diversity and authenticity. It incorporates 65 real financial MCPs and three types of samples, single tool, multi-tool, and multi-turn, allowing evaluation of models across different levels of task complexity. Using this benchmark, we systematically assess a range of mainstream LLMs and propose metrics that explicitly measure tool invocation accuracy and reasoning capabilities. FinMCP-Bench provides a standardized, practical, and challenging testbed for advancing research on financial LLM agents.