On the Expressive Power of Contextual Relations in Transformers

arXiv cs.LG / 3/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that while Transformers model contextual relationships well empirically, their expressive power is not fully characterized mathematically.
  • It proposes a measure-theoretic framework where texts are probability measures in a semantic embedding space and contextual relations are represented using coupling measures.
  • The authors introduce the “Sinkhorn Transformer,” a transformer-like architecture designed for this coupling-measure setting.
  • The main contribution is a universal approximation theorem showing that continuous coupling functions between probability measures can be uniformly approximated by a Sinkhorn Transformer with suitable parameters.

Abstract

Transformer architectures have achieved remarkable empirical success in modeling contextual relationships in natural language, yet a precise mathematical characterization of their expressive power remains incomplete. In this work, we introduce a measure-theoretic framework for contextual representations in which texts are modeled as probability measures over a semantic embedding space, and contextual relations between words, are represented as coupling measures between them. Within this setting, we introduce Sinkhorn Transformer, a transformer-like architecture. Our main result is a universal approximation theorem: any continuous coupling function between probability measures, that encodes the semantic relation coupling measure, can be uniformly approximated by a Sinkhorn Transformer with appropriate parameters.