TaxPraBen: A Scalable Benchmark for Structured Evaluation of LLMs in Chinese Real-World Tax Practice

arXiv cs.CL / 4/13/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces TaxPraBen, a dedicated, scalable benchmark for evaluating Chinese real-world tax practice capabilities rather than isolated NLP subtasks.
  • TaxPraBen includes 10 traditional application tasks and 3 real-world scenarios (tax risk prevention, inspection analysis, and strategy planning) built from 14 datasets totaling 7.3K instances.
  • The evaluation method uses a structured pipeline—structured parsing, field alignment, extraction, and numerical/text matching—to support end-to-end tax practice assessment and future extensibility.
  • Experiments across 19 LLMs show large performance gaps, with closed-source large-parameter models performing best, Chinese LLMs (e.g., Qwen2.5) generally outperforming multilingual counterparts, and YaYi2 fine-tuning yielding only limited gains.
  • The authors position TaxPraBen as a reusable resource for advancing and comparing LLMs for highly specialized, legally regulated domains like taxation.

Abstract

While Large Language Models (LLMs) excel in various general domains, they exhibit notable gaps in the highly specialized, knowledge-intensive, and legally regulated Chinese tax domain. Consequently, while tax-related benchmarks are gaining attention, many focus on isolated NLP tasks, neglecting real-world practical capabilities. To address this issue, we introduce TaxPraBen, the first dedicated benchmark for Chinese taxation practice. It combines 10 traditional application tasks, along with 3 pioneering real-world scenarios: tax risk prevention, tax inspection analysis, and tax strategy planning, sourced from 14 datasets totaling 7.3K instances. TaxPraBen features a scalable structured evaluation paradigm designed through process of "structured parsing-field alignment extraction-numerical and textual matching", enabling end-to-end tax practice assessment while being extensible to other domains. We evaluate 19 LLMs based on Bloom's taxonomy. The results indicate significant performance disparities: all closed-source large-parameter LLMs excel, and Chinese LLMs like Qwen2.5 generally exceed multilingual LLMs, while the YaYi2 LLM, fine-tuned with some tax data, shows only limited improvement. TaxPraBen serves as a vital resource for advancing evaluations of LLMs in practical applications.