FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning

arXiv cs.CL / 5/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces FinChain, a new benchmark focused on verifiable chain-of-thought (CoT) reasoning for financial multi-step analysis, addressing gaps in prior datasets that mainly test final numeric answers.
  • FinChain covers 58 topics across 12 financial domains, using parameterized symbolic templates paired with executable Python code to support fully machine-verifiable reasoning and contamination-free data generation.
  • The authors propose CHAINEVAL, a dynamic alignment metric that evaluates both final-answer correctness and step-level reasoning consistency together.
  • Experiments on 26 leading LLMs show that even frontier models struggle with symbolic financial reasoning, though domain-adapted and math-enhanced fine-tuned models can improve performance and narrow the gap.
  • The release aims to help researchers develop trustworthy, interpretable, and verifiable financial AI by making intermediate reasoning transparent and testable.

Abstract

Multi-step symbolic reasoning is essential for robust financial analysis; yet, current benchmarks largely overlook this capability. Existing datasets such as FinQA and ConvFinQA emphasize final numerical answers while neglecting the intermediate reasoning steps required for transparency and verification. To address this gap, we introduce FinChain, the first benchmark specifically designed for verifiable Chain-of-Thought evaluation in finance. FinChain spans 58 topics across 12 financial domains, each represented by parameterized symbolic templates with executable Python code that enable fully machine-verifiable reasoning and scalable, contamination-free data generation. To assess reasoning capacity, we propose CHAINEVAL, a dynamic alignment measure that jointly evaluates both the final-answer correctness and the step-level reasoning consistency. Our evaluation of 26 leading LLMs reveals that even frontier LLMs exhibit clear limitations in symbolic financial reasoning, while domain-adapted and math-enhanced fine-tuned models can substantially narrow this gap. Overall, FinChain exposes persistent weaknesses in multi-step financial reasoning and provides a foundation for developing trustworthy, interpretable, and verifiable financial AI. This project is available at https://github.com/mbzuai-nlp/finchain.git.