SigGate-GT: Taming Over-Smoothing in Graph Transformers via Sigmoid-Gated Attention

arXiv cs.LG / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • Graph transformers can suffer from over-smoothing and attention entropy degeneration, which the paper links to attention-sink behavior caused by softmax’s sum-to-one constraint.
  • SigGate-GT introduces per-head learned sigmoid gates that can suppress uninformative attention outputs inside the GraphGPS graph transformer framework.
  • Experiments on five benchmarks show SigGate-GT matches the previous best on ZINC and achieves a new state of the art on ogbg-molhiv (82.47% ROC-AUC), with statistically significant improvements over GraphGPS on all datasets.
  • Ablation results indicate the gating strategy reduces over-smoothing by 30%, increases attention entropy, and improves training stability across a wide learning-rate range, with only ~1% parameter overhead on OGB.

Abstract

Graph transformers achieve strong results on molecular and long-range reasoning tasks, yet remain hampered by over-smoothing (the progressive collapse of node representations with depth) and attention entropy degeneration. We observe that these pathologies share a root cause with attention sinks in large language models: softmax attention's sum-to-one constraint forces every node to attend somewhere, even when no informative signal exists. Motivated by recent findings that element-wise sigmoid gating eliminates attention sinks in large language models, we propose SigGate-GT, a graph transformer that applies learned, per-head sigmoid gates to the attention output within the GraphGPS framework. Each gate can suppress activations toward zero, enabling heads to selectively silence uninformative connections. On five standard benchmarks, SigGate-GT matches the prior best on ZINC (0.059 MAE) and sets new state-of-the-art on ogbg-molhiv (82.47% ROC-AUC), with statistically significant gains over GraphGPS across all five datasets (p < 0.05). Ablations show that gating reduces over-smoothing by 30% (mean relative MAD gain across 4-16 layers), increases attention entropy, and stabilizes training across a 10\times learning rate range, with about 1% parameter overhead on OGB.