Embedding World Knowledge into Tabular Models: Towards Best Practices for Embedding Pipeline Design
arXiv cs.LG / 3/19/2026
💬 OpinionModels & Research
Key Points
- The paper systematically benchmarks 256 embedding-based pipeline configurations for tabular prediction, covering 8 preprocessing strategies, 16 embedding models, and 2 downstream models.
- It finds that the benefit of incorporating LLM-derived world knowledge depends strongly on the specific pipeline design, with concatenating embeddings generally outperforming replacing original columns.
- Larger embedding models tend to yield better performance, while public leaderboard rankings and model popularity are poor indicators of actual performance.
- Gradient boosting decision trees emerge as strong downstream models in these embedding pipelines.
- The study offers practical guidance for researchers and practitioners on designing more effective embedding pipelines for tabular prediction tasks.



