Is Textual Similarity Invariant under Machine Translation? Evidence Based on the Political Manifesto Corpus

arXiv cs.CL / 5/4/2026

📰 NewsModels & Research

Key Points

  • The study tests whether cosine similarity between paragraph embeddings remains unchanged after machine translation, using over 2,800 political party platforms translated into English across 28 languages via EU eTranslation.
  • Instead of directly measuring semantic shift from translation, the researchers assess how stable pairwise similarity relationships are across different embedding models and calibrate an invariance threshold using disagreement on the original-language text.
  • They formulate a per-language, non-inferiority test to evaluate four hypotheses about the interaction between translation and embedding choice, producing language-specific verdicts.
  • The results distinguish languages where translation demonstrably preserves semantic structure from those where it demonstrably distorts it, and leave unresolved cases where evidence is insufficient.
  • The proposed framework is designed to be agnostic to the specific corpus and translation/embedding pipeline and can be extended to downstream applications.
  • Analysis of the dataset finds 10 languages showing translation invariance and 4 languages showing detectable distortion.

Abstract

We investigate the extent to which cosine similarity between paragraph embeddings is invariant under machine translation, using the Manifesto Corpus of over 2,800 political party platforms in 28 languages translated to English via the EU eTranslation service. Rather than measuring translation-induced semantic shift directly we measure the stability of pairwise similarity relationships across embedding models, and use inter-model disagreement on original-language text as a calibrated invariance threshold. This yields a per-language non-inferiority test for four hypotheses about how translation interacts with embedding choice, with verdicts that distinguish languages where translation demonstrably preserves semantic structure from those where it demonstrably degrades it and from those where the available evidence does not resolve the question. The framework is corpus- and pipeline-agnostic and extends naturally to downstream tasks. Applied to our data, it identifies ten languages with translation invariance and four with detectable distortion.