Attention Sinks in Massively Multilingual Neural Machine Translation:Discovery, Analysis, and Mitigation

arXiv cs.LG / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper identifies a systematic artifact in cross-attention analysis for the NLLB-200 multilingual NMT model: “attention sinks” where non-content tokens (EOS tokens, language tags, and punctuation) absorb 83%–91% of total cross-attention mass.
  • Because these sinks skew attention distributions, raw cross-attention metrics can severely underestimate content-level similarity by nearly half (36.7% raw vs. 70.7% after filtering), making many uncorrected interpretability studies unreliable.
  • The authors trace the effect to a vocabulary-design causal mechanism rather than position bias, extending prior LLM attention-sink findings to NMT.
  • They validate a content-only filtering and renormalization method, showing the artifact is universal across African and non-African language benchmarks and that corrected analyses recover meaningful signals (mode gaps, language-family clustering, and a “Somali paradox”).
  • The study releases a filtering toolkit and corrected datasets to enable reproducible, more trustworthy interpretability research for multilingual NMT.

Abstract

Cross-attention patterns in neural machine translation (NMT) are widely used to study how multilingual models align linguistic structure. We report a systematic artifact in cross-attention analysis of NLLB-200 (600M): non-content tokens - primarily end-of-sequence tokens, language tags, and punctuation - capture 83 percent to 91 percent of total cross-attention mass. We term these "attention sinks," extending findings from LLMs [Xiao et al., 2023] to NMT cross-attention and identifying a causal mechanism rooted in vocabulary design rather than position bias. This artifact causes raw metrics to underestimate content-level similarity by nearly half (36.7 percent raw vs. 70.7 percent filtered), rendering uncorrected analyses unreliable. To address this, we validate a content-only filtering methodology that removes non-content tokens and renormalizes the distribution. Applying this to 1,000 parallel sentences across African languages (Swahili, Kikuyu, Somali, Luo) and non-African benchmarks (German, Turkish, Chinese, Hindi), we confirm the artifact is universal and recover masked linguistic signals: a 16.9 percentage-point gap between teacher-forcing and generation modes, clear language-family clustering in attention entropy, and a hidden Somali paradox linking SOV word order to monotonic alignment. We release our filtering toolkit and corrected datasets to support reproducible interpretability research on multilingual NMT.