ScoringBench: A Benchmark for Evaluating Tabular Foundation Models with Proper Scoring Rules
arXiv cs.AI / 4/1/2026
📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- ScoringBench is introduced as an open benchmark for evaluating tabular foundation models using proper scoring rules that better capture probabilistic forecast quality than point-estimate metrics alone.
- The benchmark computes multiple distribution-aware metrics (e.g., CRPS, CRLS, Interval Score, Energy Score, weighted CRPS, and Brier Score) alongside standard regression measures like RMSE and R².
- Experiments with fine-tuned versions of realTabPFN v2.5 and TabICL show that model rankings change depending on the chosen scoring rule, indicating no single pretraining objective is universally best.
- The authors argue that proper metric selection is crucial for high-stakes domains where tail behavior and asymmetric risk are important, such as finance and clinical research.
- ScoringBench provides a public leaderboard and live preview, with updates managed via git pull requests to support transparency, traceability, and reproducibility.




