TaxPraBen: A Scalable Benchmark for Structured Evaluation of LLMs in Chinese Real-World Tax Practice
arXiv cs.CL / 4/13/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces TaxPraBen, a dedicated, scalable benchmark for evaluating Chinese real-world tax practice capabilities rather than isolated NLP subtasks.
- TaxPraBen includes 10 traditional application tasks and 3 real-world scenarios (tax risk prevention, inspection analysis, and strategy planning) built from 14 datasets totaling 7.3K instances.
- The evaluation method uses a structured pipeline—structured parsing, field alignment, extraction, and numerical/text matching—to support end-to-end tax practice assessment and future extensibility.
- Experiments across 19 LLMs show large performance gaps, with closed-source large-parameter models performing best, Chinese LLMs (e.g., Qwen2.5) generally outperforming multilingual counterparts, and YaYi2 fine-tuning yielding only limited gains.
- The authors position TaxPraBen as a reusable resource for advancing and comparing LLMs for highly specialized, legally regulated domains like taxation.
Related Articles

HANDOVER + SYNC: multi-agent coordination without a central scheduler
Dev.to

Skills as invocation contracts, not code: how I keep review authority over agent work
Dev.to

Daily AI News — 2026-04-18
Dev.to

Custom Agent or Built-In AI? A Practical Checklist for Making the Right Choice
Dev.to
Coherence-First Non-Agentive Interaction System for Stabilizing Human–AI Cognitive Fields
Reddit r/artificial