IndiaFinBench: An Evaluation Benchmark for Large Language Model Performance on Indian Financial Regulatory Text
arXiv cs.CL / 4/22/2026
📰 NewsSignals & Early TrendsIndustry & Market MovesModels & Research
Key Points
- The paper introduces IndiaFinBench, a new public evaluation benchmark designed to measure large language model (LLM) performance on Indian financial regulatory text, a gap left by prior Western-only benchmarks.
- The benchmark includes 406 expert-annotated question-answer pairs drawn from 192 SEBI and RBI documents, covering four task types: regulatory interpretation, numerical reasoning, contradiction detection, and temporal reasoning.
- Annotation quality is supported by both model-based validation (kappa=0.918 for contradiction detection) and a human inter-annotator agreement study (kappa=0.611; 76.7% overall agreement).
- In zero-shot evaluations of twelve models, accuracy ranges from 70.4% (Gemma 4 E4B) to 89.7% (Gemini 2.5 Flash), with all models outperforming a non-specialist human baseline of 60.0%.
- Numerical reasoning shows the strongest differentiation across models, and bootstrap significance testing identifies three statistically distinct performance tiers; the dataset, evaluation code, and outputs are released on GitHub.



