OmniTabBench: Mapping the Empirical Frontiers of GBDTs, Neural Networks, and Foundation Models for Tabular Data at Scale

arXiv cs.LG / 4/9/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces OmniTabBench, a large-scale benchmark for tabular data with 3,030 datasets across diverse tasks, collected from varied sources and categorized by industry using large language models.
  • It reports an extensive evaluation of state-of-the-art models spanning tree-based ensembles, neural networks, and foundation-model approaches, finding no single dominant paradigm that consistently wins.
  • Using a decoupled metafeature analysis (e.g., dataset size, feature types, and feature/target distribution characteristics like skewness and kurtosis), the study identifies conditions under which different model families perform better.
  • The authors argue that OmniTabBench addresses prior benchmark limitations—especially small benchmark sizes (<100 datasets) and potential selection bias—by providing more robust, scale-appropriate empirical evidence.

Abstract

While traditional tree-based ensemble methods have long dominated tabular tasks, deep neural networks and emerging foundation models have challenged this primacy, yet no consensus exists on a universally superior paradigm. Existing benchmarks typically contain fewer than 100 datasets, raising concerns about evaluation sufficiency and potential selection biases. To address these limitations, we introduce OmniTabBench, the largest tabular benchmark to date, comprising 3030 datasets spanning diverse tasks that are comprehensively collected from diverse sources and categorized by industry using large language models. We conduct an unprecedented large-scale empirical evaluation of state-of-the-art models from all model families on OmniTabBench, confirming the absence of a dominant winner. Furthermore, through a decoupled metafeature analysis, which examines individual properties such as dataset size, feature types, feature and target skewness/kurtosis, we elucidate conditions favoring specific model categories, providing clearer, more actionable guidance than prior compound-metric studies.