Expressivity of Transformers: A Tropical Geometry Perspective

arXiv cs.LG / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces a tropical geometry framework to measure how transformers partition space, modeling self-attention as a vector-valued tropical rational map.
  • It proves that, in the zero-temperature limit, the attention evaluation corresponds exactly to a Power Voronoi Diagram, linking transformer expressivity to a concrete geometric object.
  • Using this equivalence, the authors give a combinatorial explanation for why multi-head self-attention (MHSA) works: Minkowski sums of Newton polytopes increase polyhedral complexity from what single heads can achieve.
  • For deep transformer architectures, the study derives tight asymptotic bounds on the number of linear regions, showing a growth of Θ(N^{d_model L}) and attributing it to sequence length, embedding dimension, and depth.
  • The theoretical results also include robustness: the polyhedral/topological partitions remain stable under finite-temperature (soft) attention, supported by exponentially tight differential approximation bounds.

Abstract

To quantify the geometric expressivity of transformers, we introduce a tropical geometry framework to characterize their exact spatial partitioning capabilities. By modeling self-attention as a vector-valued tropical rational map, we prove it evaluates exactly to a Power Voronoi Diagram in the zero-temperature limit. Building on this equivalence, we establish a combinatorial rationale for Multi-Head Self-Attention (MHSA): via the Minkowski sum of Newton polytopes, multi-head aggregation expands the polyhedral complexity to \mathcal{O}(N^H), overcoming the \mathcal{O}(N) bottleneck of single heads. Extending this to deep architectures, we derive the first tight asymptotic bounds on the number of linear regions in transformers (\Theta(N^{d_{\text{model}}L})), demonstrating a combinatorial explosion driven intrinsically by sequence length N, ambient embedding dimension d_{\text{model}}, and network depth L. Importantly, we guarantee that this idealized polyhedral skeleton is geometrically stable: finite-temperature soft attention preserves these topological partitions via exponentially tight differential approximation bounds.