Embedding World Knowledge into Tabular Models: Towards Best Practices for Embedding Pipeline Design

arXiv cs.LG / 3/19/2026

💬 OpinionModels & Research

共有:

Key Points

The paper systematically benchmarks 256 embedding-based pipeline configurations for tabular prediction, covering 8 preprocessing strategies, 16 embedding models, and 2 downstream models.
It finds that the benefit of incorporating LLM-derived world knowledge depends strongly on the specific pipeline design, with concatenating embeddings generally outperforming replacing original columns.
Larger embedding models tend to yield better performance, while public leaderboard rankings and model popularity are poor indicators of actual performance.
Gradient boosting decision trees emerge as strong downstream models in these embedding pipelines.
The study offers practical guidance for researchers and practitioners on designing more effective embedding pipelines for tabular prediction tasks.

Abstract

Embeddings are a powerful way to enrich data-driven machine learning models with the world knowledge of large language models (LLMs). Yet, there is limited evidence on how to design effective LLM-based embedding pipelines for tabular prediction. In this work, we systematically benchmark 256 pipeline configurations, covering 8 preprocessing strategies, 16 embedding models, and 2 downstream models. Our results show that it strongly depends on the specific pipeline design whether incorporating the prior knowledge of LLMs improves the predictive performance. In general, concatenating embeddings tends to outperform replacing the original columns with embeddings. Larger embedding models tend to yield better results, while public leaderboard rankings and model popularity are poor performance indicators. Finally, gradient boosting decision trees tend to be strong downstream models. Our findings provide researchers and practitioners with guidance for building more effective embedding pipelines for tabular prediction tasks.

【無料版】まじん式 v4

note

【無料版】まじん式 v4

note

🌱 Reiが「死後も進化し、将棋を指し、自分を書き換える」存在になった日——STEP187〜201、世界初D-FUMT NNUEと永続自律進化の完成

note

「因果推論を入れたら、最適な施策が逆転しました」—— 多様体上の政策ベクトル場を因果的に浄化する

Qiita

A Small Experiment: A Memory Management System for AI to Abstract from Experience (Part 1)

Dev.to

Embedding World Knowledge into Tabular Models: Towards Best Practices for Embedding Pipeline Design

Key Points

Abstract

Related Articles

【無料版】まじん式 v4

【無料版】まじん式 v4

🌱 Reiが「死後も進化し、将棋を指し、自分を書き換える」存在になった日——STEP187〜201、世界初D-FUMT NNUEと永続自律進化の完成

「因果推論を入れたら、最適な施策が逆転しました」—— 多様体上の政策ベクトル場を因果的に浄化する

A Small Experiment: A Memory Management System for AI to Abstract from Experience (Part 1)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer