F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

arXiv cs.CL / 3/20/2026

📰 NewsTools & Practical UsageModels & Research

共有:

Key Points

F2LLM-v2 is a new family of multilingual embedding models spanning 80M to 14B parameters, trained on a curated dataset of 60 million samples and supporting over 200 languages.
The training uses a two-stage LLM-based embedding pipeline with matryoshka learning, model pruning, and knowledge distillation to boost efficiency while preserving performance, with F2LLM-v2-14B ranking first on 11 MTEB benchmarks.
The release emphasizes open-source access, making all models, data, code, and intermediate checkpoints available to the research community.
The smaller models set new state-of-the-art results for resource-constrained applications and advance support for underserved mid- and low-resource languages.

Abstract

We present F2LLM-v2, a new family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a newly curated composite of 60 million publicly available high-quality data samples, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages. By integrating a two-stage LLM-based embedding training pipeline with matryoshka learning, model pruning, and knowledge distillation techniques, we present models that are far more efficient than previous LLM-based embedding models while retaining competitive performances. Extensive evaluations confirm that F2LLM-v2-14B ranks first on 11 MTEB benchmarks, while the smaller models in the family also set a new state of the art for resource-constrained applications. To facilitate open-source embedding model research, we release all models, data, code, and intermediate checkpoints.

Day 10: An AI Agent's Revenue Report — $29, 25 Products, 160 Tweets

Dev.to

Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

Dev.to

What CVE-2026-25253 Taught Me About Building Safe AI Assistants

Dev.to

Krish Naik: AI Learning Path For 2026- Data Science, Generative and Agentic AI Roadmap

Dev.to

Day 52: Building vs Shipping — Why We Had 711 Commits and 0 Users

Dev.to

F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

Key Points

Abstract

Related Articles

Day 10: An AI Agent's Revenue Report — $29, 25 Products, 160 Tweets

Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

What CVE-2026-25253 Taught Me About Building Safe AI Assistants

Krish Naik: AI Learning Path For 2026- Data Science, Generative and Agentic AI Roadmap

Day 52: Building vs Shipping — Why We Had 711 Commits and 0 Users

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer