Measurement Risk in Supervised Financial NLP: Rubric and Metric Sensitivity on JF-ICR
arXiv cs.AI / 5/1/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that supervised financial NLP benchmarks are not “objective” when the rubric wording, metric choice, or aggregation policy can change the resulting labels and model rankings.
- Using JF-ICR (a 253-item split evaluated across 4 frontier LLMs, 5 rubrics, 3 temperatures, and 5 ordinal metrics), the authors show rubric wording can significantly shift assigned labels, especially around the +1/0 boundary.
- The study finds that some commonly used metrics become uninformative or too noisy under JF-ICR’s class distribution, making exact accuracy, macro-F1, and weighted kappa the identifiable metrics under their rules.
- After restricting to the identifiable metric subset, ranking methods (Bradley–Terry, Borda, Ranked Pairs) align more closely, while using all metrics leads to disagreement on the nearest candidates.
- The work is framed as a governance/reporting discipline for financial NLP benchmarks rather than a new leaderboard contribution.
Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works
Dev.to

Why Enterprise AI Pilots Fail
Dev.to

The PDF Feature Nobody Asked For (That I Use Every Day)
Dev.to

How to Fix OpenClaw Tool Calling Issues
Dev.to

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model
THE DECODER