FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks
arXiv cs.AI / 4/14/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- FinTrace is introduced as a new trajectory-level benchmark for evaluating LLM tool calling in long-horizon financial tasks, addressing limitations of existing call-level metrics and narrow scenarios.
- The benchmark includes 800 expert-annotated trajectories across 34 real-world financial task categories and uses a rubric with nine metrics across four axes: action correctness, execution efficiency, process quality, and output quality.
- Evaluations of 13 LLMs show a recurring gap: models can often select the right tools, but struggle with information utilization and producing high-quality final answers.
- To go beyond diagnosis, the paper constructs FinTrace-Training, an 8,196-trajectory preference dataset with tool-augmented contexts and preference pairs for financial tool calling.
- Fine-tuning Qwen-3.5-9B with supervised fine-tuning plus DPO improves intermediate reasoning/process metrics and reduces failure modes, but end-to-end final answer quality remains a bottleneck.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Emerging Properties in Unified Multimodal Pretraining
Dev.to

Build a Profit-Generating AI Agent with LangChain: A Step-by-Step Tutorial
Dev.to

Open source AI is winning — but here's why I still pay $2/month for Claude API
Dev.to

AI Agents Need Real Email Infrastructure
Dev.to

Beyond the Prompt: Why AI Agents Are Hitting the Deployment Wall
Dev.to