Expressivity of Transformers: A Tropical Geometry Perspective

arXiv cs.LG / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces a tropical geometry framework to measure how transformers partition space, modeling self-attention as a vector-valued tropical rational map.
It proves that, in the zero-temperature limit, the attention evaluation corresponds exactly to a Power Voronoi Diagram, linking transformer expressivity to a concrete geometric object.
Using this equivalence, the authors give a combinatorial explanation for why multi-head self-attention (MHSA) works: Minkowski sums of Newton polytopes increase polyhedral complexity from what single heads can achieve.
For deep transformer architectures, the study derives tight asymptotic bounds on the number of linear regions, showing a growth of Θ(N^{d_model L}) and attributing it to sequence length, embedding dimension, and depth.
The theoretical results also include robustness: the polyhedral/topological partitions remain stable under finite-temperature (soft) attention, supported by exponentially tight differential approximation bounds.

Abstract

To quantify the geometric expressivity of transformers, we introduce a tropical geometry framework to characterize their exact spatial partitioning capabilities. By modeling self-attention as a vector-valued tropical rational map, we prove it evaluates exactly to a Power Voronoi Diagram in the zero-temperature limit. Building on this equivalence, we establish a combinatorial rationale for Multi-Head Self-Attention (MHSA): via the Minkowski sum of Newton polytopes, multi-head aggregation expands the polyhedral complexity to

\mathcal{O}(N^H)

, overcoming the

\mathcal{O}(N)

bottleneck of single heads. Extending this to deep architectures, we derive the first tight asymptotic bounds on the number of linear regions in transformers (

\Theta(N^{d_{\text{model}}L})

), demonstrating a combinatorial explosion driven intrinsically by sequence length

N

, ambient embedding dimension

d_{\text{model}}

, and network depth

L

. Importantly, we guarantee that this idealized polyhedral skeleton is geometrically stable: finite-temperature soft attention preserves these topological partitions via exponentially tight differential approximation bounds.