Gated Subspace Inference for Transformer Acceleration

arXiv cs.LG / 5/6/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper proposes a transformer inference acceleration method that leverages the low effective rank of token activation manifolds at each layer.
  • It splits each token activation into a low-dimensional subspace component and a residual, then uses cached low-rank linear weights for the subspace while applying a per-token gate to optionally skip residual computation.
  • The gating scheme is designed to keep the output distribution close to the baseline within a controllable tolerance.
  • Experiments on GPT-2 124M, GPT-J 6B, and OPT 6.7B show 3.0x–10.5x speedups on linear-layer weight reads while maintaining perplexity ratios below 1.00 and top-1 agreement above 98% on AMD MI300X.
  • The approach requires no retraining and no architectural changes, preserves attention exactly, and reports character-for-character identical outputs at a specific operating point (k=256, ε=0.05) on GPT-J 14 6B.

Abstract

A method is presented for accelerating inference in transformer language models by exploiting the low effective rank of the token activation manifold at each layer. The method decomposes each activation vector into a subspace component and a residual, computes the linear-layer output on the subspace component via a cached low-rank weight image at reduced memory bandwidth, and applies a per-token gate that determines whether the residual correction is computed or skipped. The gate ensures that the output distribution is preserved to within a controllable tolerance. Validation on three model families (GPT-2 124M, GPT-J 6B, OPT 6.7B) on AMD MI300X demonstrates effective speedups of 3.0x to 10.5x on linear-layer weight reads with perplexity ratios below 1.00 and top-1 token agreement above 98%. The method requires no retraining, no architectural modification, and no approximation of the attention mechanism. At the operating point (k = 256, {\epsilon} = 0.05) on GPT-J 14 6B, the accelerated model produces character-for-character identical output to the baseline.