Graph Memory Transformer (GMT)

arXiv cs.LG / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The Graph Memory Transformer (GMT) explores replacing the FFN sublayer in a decoder-only transformer with an explicit learned memory graph while keeping causal self-attention and the autoregressive decoder structure.
  • GMT routes token representations through a learned bank of centroids using a directed transition matrix, producing a “movement” from a source memory state to a target memory state rather than retrieving a value.
  • In the studied base GMT v7 configuration, each of 16 transformer blocks uses 128 centroids and associated edge/transition structures, with a gated displacement readout that enables direct inspection of centroid usage and transition behavior.
  • The base GMT v7 is an 82.2M-parameter decoder-only language model without dense FFN sublayers, but it underperforms the 103.0M dense GPT-style baseline on validation loss and perplexity.
  • The authors emphasize that results are not a state-of-the-art claim and position GMT as evidence that graph-mediated memory navigation can make within-token transformations more structurally interpretable, with scaling and broader evaluation left for future work.

Abstract

We investigate whether the Feed-Forward Network (FFN) sublayer in a decoder-only transformer can be replaced by an explicit learned memory graph while preserving the surrounding autoregressive architecture. The proposed Graph Memory Transformer (GMT) keeps causal self-attention intact, but replaces the usual per-token FFN transformation with a memory cell that routes token representations over a learned bank of centroids connected by a learned directed transition matrix. In the base GMT v7 instantiation studied here, each of 16 transformer blocks contains 128 centroids, a 128 * 128 edge matrix, gravitational source routing, token-conditioned target selection, and a gated displacement readout. The cell therefore returns movement from an estimated source memory state toward a target memory state, rather than a retrieved value. The resulting model is a fully decoder-only language model with 82.2M trainable parameters and no dense FFN sublayers, compared with a 103.0M-parameter dense GPT-style baseline used in the evaluation. The base v7 model trains stably and exposes centroid usage, transition structure, and source-to-target movement as directly inspectable quantities of the forward computation. It remains behind the larger dense baseline in validation loss and perplexity (3.5995/36.58 vs. 3.2903/26.85), while showing close zero-shot benchmark behavior under the evaluated setting. These results are not intended as a state-of-the-art claim; they support the viability and structural interpretability of replacing dense within-token transformation with graph-mediated memory navigation. Broader scaling, optimized kernels, and more extensive benchmark evaluation are left for subsequent work.