Less is More: Adapting Text Embeddings for Low-Resource Languages with Small Scale Noisy Synthetic Data

arXiv cs.CL / 3/25/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that low-resource language (LRL) text embedding improvements do not necessarily require massive or pristine human-verified translation datasets.
  • Using Armenian as a case study, it proposes a low-cost adaptation approach that fine-tunes a multilingual encoder (mE5) on only 10,000 noisy synthetic pairs generated from English Reddit title-body translations using open-weights models.
  • Experiments show a “Less is More” effect: fine-tuning on the small noisy set yields 11–12% average benchmark gains and 20%+ relative retrieval improvements, comparable to models trained on about 1 million examples.
  • Increasing synthetic data scale, upgrading translation quality with state-of-the-art LLMs, or diversifying domains did not produce meaningful gains beyond the minimal baseline, suggesting early saturation in semantic alignment.
  • The authors validate results on another script-unique LRL, release the model/data/benchmark for reproducibility, and position the findings as enabling high-performance embeddings for resource-constrained communities.

Abstract

Low-resource languages (LRLs) often lack high-quality, large-scale datasets for training effective text embedding models, hindering their application in tasks like retrieval-augmented generation (RAG) and semantic search. In this work, we challenge the prevailing assumption that effective semantic alignment requires massive datasets or pristine, human-verified translations. Focusing on Armenian (an LRL with a unique script), we introduce a cost-effective adaptation strategy using small scale noisy synthetic data generated by translating English Reddit title-body pairs with open-weights models. We establish a comprehensive evaluation benchmark comprising existing datasets, translated data, and a manually curated dataset. Our experiments reveal a surprising "Less is More" phenomenon: fine-tuning a multilingual encoder (mE5) on just 10,000 noisy synthetic pairs yields 11-12\% average improvements across the benchmark with a 20\%+ relative improvement in retrieval performance, matching the performance of models trained on ~1 million examples. Furthermore, we demonstrate that neither increasing data scale, improving translation quality via state-of-the-art LLMs, nor diversifying data domains yields significant gains over this minimal baseline. We validate the generalizability of these findings on another LRL with a unique script. Our results suggest that semantic alignment for LRLs saturates early and is highly robust to noise, democratizing high-performance embedding creation for resource-constrained communities. We release the model, data, and the benchmark at https://metric-ai-lab.github.io/less-is-more-embeddings/ to facilitate further research.