Is Textual Similarity Invariant under Machine Translation? Evidence Based on the Political Manifesto Corpus

arXiv cs.CL / 5/4/2026

📰 NewsModels & Research

共有:

Key Points

The study tests whether cosine similarity between paragraph embeddings remains unchanged after machine translation, using over 2,800 political party platforms translated into English across 28 languages via EU eTranslation.
Instead of directly measuring semantic shift from translation, the researchers assess how stable pairwise similarity relationships are across different embedding models and calibrate an invariance threshold using disagreement on the original-language text.
They formulate a per-language, non-inferiority test to evaluate four hypotheses about the interaction between translation and embedding choice, producing language-specific verdicts.
The results distinguish languages where translation demonstrably preserves semantic structure from those where it demonstrably distorts it, and leave unresolved cases where evidence is insufficient.
The proposed framework is designed to be agnostic to the specific corpus and translation/embedding pipeline and can be extended to downstream applications.
Analysis of the dataset finds 10 languages showing translation invariance and 4 languages showing detectable distortion.

Abstract

We investigate the extent to which cosine similarity between paragraph embeddings is invariant under machine translation, using the Manifesto Corpus of over 2,800 political party platforms in 28 languages translated to English via the EU eTranslation service. Rather than measuring translation-induced semantic shift directly we measure the stability of pairwise similarity relationships across embedding models, and use inter-model disagreement on original-language text as a calibrated invariance threshold. This yields a per-language non-inferiority test for four hypotheses about how translation interacts with embedding choice, with verdicts that distinguish languages where translation demonstrably preserves semantic structure from those where it demonstrably degrades it and from those where the available evidence does not resolve the question. The framework is corpus- and pipeline-agnostic and extends naturally to downstream tasks. Applied to our data, it identifies ten languages with translation invariance and four with detectable distortion.

AnnouncementsBuilding a new enterprise AI services company with Blackstone, Hellman & Friedman, and Goldman Sachs

Anthropic News

Dara Khosrowshahi on replacing Uber drivers — and himself — with AI

The Verge

CLMA Frame Test

Dev.to

Governance and Liability in AI Agents: What I Built Trying to Answer Those Questions

Dev.to

Roundtable chat with Talkie-1930 and Gemma 4 31B

Reddit r/LocalLLaMA

Is Textual Similarity Invariant under Machine Translation? Evidence Based on the Political Manifesto Corpus

Key Points

Abstract

Related Articles

AnnouncementsBuilding a new enterprise AI services company with Blackstone, Hellman & Friedman, and Goldman Sachs

Dara Khosrowshahi on replacing Uber drivers — and himself — with AI

CLMA Frame Test

Governance and Liability in AI Agents: What I Built Trying to Answer Those Questions

Roundtable chat with Talkie-1930 and Gemma 4 31B

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer