Towards Universal Tabular Embeddings: A Benchmark Across Data Tasks

arXiv cs.LG / 4/24/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces TEmBed, a benchmark designed to compare tabular embedding (tabular foundation model) methods across multiple representation levels—cell, row, column, and whole table.
It argues that existing tabular embedding models are difficult to compare because evaluations are often done in task-specific settings, so TEmBed aims to standardize assessment.
By evaluating a wide range of tabular representation learning models, the authors find that the best-performing embedding approach depends on both the task type and the representation granularity.
The findings provide actionable guidance for choosing tabular embeddings for real-world applications such as table retrieval, semantic search, and table-based prediction, while also supporting future development of more general-purpose tabular representation models.

Abstract

Tabular foundation models aim to learn universal representations of tabular data that transfer across tasks and domains, enabling applications such as table retrieval, semantic search and table-based prediction. Despite the growing number of such models, it remains unclear which approach works best in practice, as existing methods are often evaluated under task-specific settings that make direct comparison difficult. To address this, we introduce TEmBed, the Tabular Embedding Test Bed, a comprehensive benchmark for systematically evaluating tabular embeddings across four representation levels: cell, row, column, and table. Evaluating a diverse set of tabular representation learning models, we show that which model to use depends on the task and representation level. Our results offer practical guidance for selecting tabular embeddings in real-world applications and lay the groundwork for developing more general-purpose tabular representation models.