On the Geometry of Positional Encodings in Transformers

arXiv cs.LG / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that positional encodings need a principled mathematical theory rather than trial-and-error design, and develops such a framework for Transformers.
  • It proves that Transformers lacking any positional signal cannot solve tasks whose outcomes depend on word order (Necessity Theorem).
  • Under mild, verifiable conditions, it shows that training yields distinct vector representations for different sequence positions at every global minimizer (Positional Separation Theorem).
  • It formulates an information-optimal encoding objective by constructing an embedding via classical multidimensional scaling (MDS) on Hellinger distance between positional distributions, using “stress” as a single quality metric.
  • The work demonstrates that the optimal encoding has an effective rank r ≤ n−1 and can be parameter-efficiently represented, with experiments suggesting ALiBi achieves much lower stress than sinusoidal encodings and RoPE in line with a rank-1 interpretation.

Abstract

Neural language models process sequences of words, but the mathematical operations inside them are insensitive to the order in which words appear. Positional encodings are the component added to remedy this. Despite their importance, positional encodings have been designed largely by trial and error, without a mathematical theory of what they ought to do. This paper develops such a theory. Four results are established. First, any Transformer without a positional signal cannot solve any task sensitive to word order (Necessity Theorem). Second, training assigns distinct vector representations to distinct sequence positions at every global minimiser, under mild and verifiable conditions (Positional Separation Theorem). Third, the best achievable approximation to an information-optimal encoding is constructed via classical multidimensional scaling (MDS) on the Hellinger distance between positional distributions; the quality of any encoding is measured by a single number, the stress (Proposition 5, Algorithm 1). Fourth, the optimal encoding has effective rank r = rank(B) <= n-1 and can be represented with r(n+d) parameters instead of nd (minimal parametrisation result). Appendix A develops a proof of the Monotonicity Conjecture within the Neural Tangent Kernel (NTK) regime for masked language modelling (MLM) losses, sequence classification losses, and general losses satisfying a positional sufficiency condition, through five lemmas. Experiments on SST-2 and IMDB with BERT-base confirm the theoretical predictions and reveal that Attention with Linear Biases (ALiBi) achieves much lower stress than the sinusoidal encoding and Rotary Position Embedding (RoPE), consistent with a rank-1 interpretation of the MDS encoding under approximate shift-equivariance.