Gated Subspace Inference for Transformer Acceleration

arXiv cs.LG / 5/6/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper proposes a transformer inference acceleration method that leverages the low effective rank of token activation manifolds at each layer.
It splits each token activation into a low-dimensional subspace component and a residual, then uses cached low-rank linear weights for the subspace while applying a per-token gate to optionally skip residual computation.
The gating scheme is designed to keep the output distribution close to the baseline within a controllable tolerance.
Experiments on GPT-2 124M, GPT-J 6B, and OPT 6.7B show 3.0x–10.5x speedups on linear-layer weight reads while maintaining perplexity ratios below 1.00 and top-1 agreement above 98% on AMD MI300X.
The approach requires no retraining and no architectural changes, preserves attention exactly, and reports character-for-character identical outputs at a specific operating point (k=256, ε=0.05) on GPT-J 14 6B.

Abstract

A method is presented for accelerating inference in transformer language models by exploiting the low effective rank of the token activation manifold at each layer. The method decomposes each activation vector into a subspace component and a residual, computes the linear-layer output on the subspace component via a cached low-rank weight image at reduced memory bandwidth, and applies a per-token gate that determines whether the residual correction is computed or skipped. The gate ensures that the output distribution is preserved to within a controllable tolerance. Validation on three model families (GPT-2 124M, GPT-J 6B, OPT 6.7B) on AMD MI300X demonstrates effective speedups of 3.0x to 10.5x on linear-layer weight reads with perplexity ratios below 1.00 and top-1 token agreement above 98%. The method requires no retraining, no architectural modification, and no approximation of the attention mechanism. At the operating point (k = 256, {\epsilon} = 0.05) on GPT-J 14 6B, the accelerated model produces character-for-character identical output to the baseline.

SIFS (SIFS Is Fast Search) - local code search for coding agents

Dev.to

BizNode's semantic memory (Qdrant) makes your bot smarter over time — it remembers past conversations and answers...

Dev.to

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

MarkTechPost

Solidity LM surpasses Opus

Reddit r/LocalLLaMA

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...)

Reddit r/LocalLLaMA

Gated Subspace Inference for Transformer Acceleration

Key Points

Abstract

Related Articles

SIFS (SIFS Is Fast Search) - local code search for coding agents

BizNode's semantic memory (Qdrant) makes your bot smarter over time — it remembers past conversations and answers...

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

Solidity LM surpasses Opus

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer