FinTruthQA: A Benchmark for AI-Driven Financial Disclosure Quality Assessment in Investor -- Firm Interactions

arXiv cs.CL / 3/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces FinTruthQA, described as the first benchmark for AI-driven assessment of financial disclosure quality in investor–firm interaction Q&A on Chinese stock exchange investor platforms.
  • FinTruthQA contains 6,000 real-world Q&A entries manually annotated using four criteria: question identification, question relevance, answer readability, and answer relevance.
  • Benchmarking across statistical ML, pre-trained/fine-tuned language models, and LLM approaches shows high performance for question identification and relevance (F1 > 95%) but notably lower accuracy for answer readability (~88% Micro F1) and especially answer relevance (~80% Micro F1).
  • Domain- and task-adapted pre-trained models outperform general-purpose models and LLM prompting methods in the hardest evaluation settings, suggesting adaptation is important for fine-grained disclosure quality scoring.
  • The authors position FinTruthQA as a practical foundation for AI-based disclosure monitoring to support regulatory oversight, investor protection, and corporate disclosure governance.

Abstract

Accurate and transparent financial information disclosure is essential for market efficiency, investor decision-making, and corporate governance. Chinese stock exchanges' investor interactive platforms provide a widely used channel through which listed firms respond to investor concerns, yet these responses are often limited or non-substantive, making disclosure quality difficult to assess at scale. To address this challenge, we introduce FinTruthQA, to our knowledge the first benchmark for AI-driven assessment of financial disclosure quality in investor-firm interactions. FinTruthQA comprises 6,000 real-world financial Q&A entries, each manually annotated based on four key evaluation criteria: question identification, question relevance, answer readability, and answer relevance. We benchmark statistical machine learning models, pre-trained language models and their fine-tuned variants, as well as large language models (LLMs), on FinTruthQA. Experiments show that existing models achieve strong performance on question identification and question relevance (F1 > 95%), but remain substantially weaker on answer readability (Micro F1 approximately 88%) and especially answer relevance (Micro F1 approximately 80%), highlighting the nontrivial difficulty of fine-grained disclosure quality assessment. Domain- and task-adapted pre-trained language models consistently outperform general-purpose models and LLM-based prompting on the most challenging settings. These findings position FinTruthQA as a practical foundation for AI-driven disclosure monitoring in capital markets, with value for regulatory oversight, investor protection, and disclosure governance in real-world financial settings.