Measurement Risk in Supervised Financial NLP: Rubric and Metric Sensitivity on JF-ICR

arXiv cs.AI / 5/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that supervised financial NLP benchmarks are not “objective” when the rubric wording, metric choice, or aggregation policy can change the resulting labels and model rankings.
  • Using JF-ICR (a 253-item split evaluated across 4 frontier LLMs, 5 rubrics, 3 temperatures, and 5 ordinal metrics), the authors show rubric wording can significantly shift assigned labels, especially around the +1/0 boundary.
  • The study finds that some commonly used metrics become uninformative or too noisy under JF-ICR’s class distribution, making exact accuracy, macro-F1, and weighted kappa the identifiable metrics under their rules.
  • After restricting to the identifiable metric subset, ranking methods (Bradley–Terry, Borda, Ranked Pairs) align more closely, while using all metrics leads to disagreement on the nearest candidates.
  • The work is framed as a governance/reporting discipline for financial NLP benchmarks rather than a new leaderboard contribution.

Abstract

As LLMs become credible readers of earnings calls, investor-relations Q\&A, guidance, and disclosure language, supervised financial NLP benchmarks increasingly function as decision evidence for model selection and deployment. A hidden assumption is that gold labels make such evidence objective. This assumption breaks down when the benchmark ruler itself is sensitive to rubric wording, metric choice, or aggregation policy. We study this measurement risk on Japanese Financial Implicit-Commitment Recognition (JF-ICR; a pinned 253-item test split x 4 frontier LLMs x 5 rubrics x 3 temperatures x 5 ordinal metrics). Three findings follow. First, rubric wording materially changes model-assigned labels: R2--R3 agreement ranges from 70.0% to 83.4%, with the dominant movement near the +1 / 0 implicit-commitment boundary. This pattern is consistent with a pragmatic-boundary interpretation, but is not a validated linguistic-causality claim because the present rubric variants confound semantics, examples, and verbosity. Second, not every metric remains informative under the JF-ICR class distribution. Within-one accuracy is too easy because near misses receive credit and the majority class dominates; worst-class accuracy is too noisy because the rarest class has only two examples. Exact accuracy, macro-F1, and weighted \k{appa} are therefore the identifiable metrics under our operational rule. Third, ranking claims become more defensible only after this metric-identifiability audit: Bradley--Terry, Borda, and Ranked Pairs agree on the identifiable metric subset, while the full five-metric sweep produces disagreement on the closest pair. The contribution is not a new leaderboard, but a reporting discipline for supervised financial benchmarks whose gold labels exist and whose evaluation ruler still requires governance.