Benchmarking Tabular Foundation Models for Conditional Density Estimation in Regression

arXiv cs.LG / 3/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper benchmarks tabular foundation model variants (e.g., TabPFN/TabICL-style models) for conditional density estimation (CDE) in regression, focusing on recovering full predictive distributions rather than point estimates.
Evaluations on 39 real-world tabular datasets across training sizes (50 to 20,000) and multiple baselines show foundation models generally achieve the best density accuracy, log-likelihood, and CRPS, indicating strong off-the-shelf CDE performance.
Calibration is competitive at small sample sizes, but for certain datasets/metrics it can lag specialized neural CDE baselines as data size grows, implying that post-hoc recalibration may improve reliability.
In an SDSS DR18 photometric redshift case study, a TabPFN variant trained on 50,000 galaxies outperforms baselines trained on the full 500,000-galaxy dataset, suggesting sample-efficiency benefits.
The results position tabular foundation models as effective general-purpose conditional density estimators, filling a gap where CDE performance had not been systematically assessed compared with point prediction.

Abstract

Conditional density estimation (CDE) - recovering the full conditional distribution of a response given tabular covariates - is essential in settings with heteroscedasticity, multimodality, or asymmetric uncertainty. Recent tabular foundation models, such as TabPFN and TabICL, naturally produce predictive distributions, but their effectiveness as general-purpose CDE methods has not been systematically evaluated, unlike their performance for point prediction, which is well studied. We benchmark three tabular foundation model variants against a diverse set of parametric, tree-based, and neural CDE baselines on 39 real-world datasets, across training sizes from 50 to 20,000, using six metrics covering density accuracy, calibration, and computation time. Across all sample sizes, foundation models achieve the best CDE loss, log-likelihood, and CRPS on the large majority of datasets tested. Calibration is competitive at small sample sizes but, for some metrics and datasets, lags behind task-specific neural baselines at larger sample sizes, suggesting that post-hoc recalibration may be a valuable complement. In a photometric redshift case study using SDSS DR18, TabPFN exposed to 50,000 training galaxies outperforms all baselines trained on the full 500,000-galaxy dataset. Taken together, these results establish tabular foundation models as strong off-the-shelf conditional density estimators.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/30DailyView insight →

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Simon Willison's Blog

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

Dev.to

I missed the "fun" part in software development

Dev.to

The Billion Dollar Tax on AI Agents

Dev.to

Hermes Agent: A Self-Improving AI Agent That Runs Anywhere

Dev.to

Benchmarking Tabular Foundation Models for Conditional Density Estimation in Regression

Key Points

Abstract

💡 Insights using this article

Related Articles

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

I missed the "fun" part in software development

The Billion Dollar Tax on AI Agents

Hermes Agent: A Self-Improving AI Agent That Runs Anywhere

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer