PL-MTEB: Polish Massive Text Embedding Benchmark

arXiv cs.CL / 4/27/2026

💬 OpinionModels & Research

共有:

Key Points

The paper introduces PL-MTEB, a benchmark focused on evaluating text embedding models for Polish, covering 30 tasks across five NLP categories.
PL-MTEB extends the existing MTEB by adding 12 new Polish-language tasks derived from existing datasets and by creating two new datasets to support four clustering tasks.
The authors evaluate 30 publicly available text embedding models, including both Polish-specific and multilingual options.
Results are analyzed in detail by task type and model size, and the benchmark materials (datasets), evaluation code, and results are released publicly on GitHub.

Abstract

In this paper, we introduce the Polish Massive Text Embedding Benchmark (PL-MTEB), a comprehensive benchmark for text embeddings in the Polish language. PL-MTEB comprises 30 diverse NLP tasks across five categories: classification, clustering, pair classification, information retrieval, and semantic text similarity. Within the scope of this work, we added 12 new Polish-language tasks to MTEB based on existing datasets and prepared two new datasets used to create four clustering tasks. We evaluated 30 publicly available text embedding models, including Polish and multilingual models. We analyzed the results in detail for specific task types and model sizes. We made the prepared datasets, the source code for evaluation, and the obtained results available to the public at https://github.com/rafalposwiata/pl-mteb.