Is Textual Similarity Invariant under Machine Translation? Evidence Based on the Political Manifesto Corpus
arXiv cs.CL / 5/4/2026
📰 NewsModels & Research
Key Points
- The study tests whether cosine similarity between paragraph embeddings remains unchanged after machine translation, using over 2,800 political party platforms translated into English across 28 languages via EU eTranslation.
- Instead of directly measuring semantic shift from translation, the researchers assess how stable pairwise similarity relationships are across different embedding models and calibrate an invariance threshold using disagreement on the original-language text.
- They formulate a per-language, non-inferiority test to evaluate four hypotheses about the interaction between translation and embedding choice, producing language-specific verdicts.
- The results distinguish languages where translation demonstrably preserves semantic structure from those where it demonstrably distorts it, and leave unresolved cases where evidence is insufficient.
- The proposed framework is designed to be agnostic to the specific corpus and translation/embedding pipeline and can be extended to downstream applications.
- Analysis of the dataset finds 10 languages showing translation invariance and 4 languages showing detectable distortion.
Related Articles
AnnouncementsBuilding a new enterprise AI services company with Blackstone, Hellman & Friedman, and Goldman Sachs
Anthropic News

Dara Khosrowshahi on replacing Uber drivers — and himself — with AI
The Verge

CLMA Frame Test
Dev.to

Governance and Liability in AI Agents: What I Built Trying to Answer Those Questions
Dev.to

Roundtable chat with Talkie-1930 and Gemma 4 31B
Reddit r/LocalLLaMA