BASIS: Balanced Activation Sketching with Invariant Scalars for "Ghost Backpropagation"

arXiv cs.LG / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces BASIS (Balanced Activation Sketching with Invariant Scalars), a new “ghost backpropagation” method that aims to reduce the activation-memory bottleneck that makes exact backpropagation scale as O(L * B * N).
  • BASIS preserves exact gradient flow for activations (dX) while computing weight updates (dW) using highly compressed rank-R (sketched) tensors, reducing backward compute and memory requirements to about O(L * R * N).
  • To address instability from sketched gradients, BASIS adds Balanced Hashing to eliminate off-diagonal collision variance and Invariant Scalars to maintain the exact continuous energy norm of the spatial geometry via a controlled bias-variance tradeoff.
  • Empirically, training a GPT-style model for 50,000 steps shows BASIS matches or slightly improves exact backprop validation loss (6.575 vs. 6.616) at R=32, and still converges smoothly even at extreme compression (R=1), suggesting strong robustness and an implicit regularization effect.
  • The authors release the implementation on GitHub, enabling direct experimentation with BASIS in deep and GPT-like architectures.

Abstract

The activation memory required for exact backpropagation scales linearly with network depth, context length, and feature dimensionality, forming an O(L * BN ) spatial bottleneck (where B is the sequence-batch cardinality and N is the feature dimension). This constraint historically throttles the scaling of deep neural networks. While randomized automatic differentiation attempts to mitigate this, it historically suffers from catastrophic variance. In this paper, we introduce BASIS (Balanced Activation Sketching with Invariant Scalars), an efficient backpropagation algorithm that fully decouples activation memory from the batch and sequence dimensions. BASIS propagates the exact error signal (dX) to preserve flawless gradient flow, but computes the weight updates (dW) using massively compressed rank-R tensors. To solve the foundational instability of sketched gradients, we propose two novel mechanisms: Balanced Hashing, which strictly eliminates off-diagonal collision variance, and Invariant Scalars, a principled bias-variance tradeoff that deterministically preserves the exact continuous energy norm of the spatial geometry. Theoretically, BASIS reduces activation memory to O(L * RN ) and heavily decreases the backward pass matrix-multiplication footprint. Empirically, training a GPT architecture for 50,000 steps validates our theoretical guarantees: at R = 32, BASIS achieves parity with (and marginally outperforms) exact backpropagation validation loss (6.575 vs. 6.616), acting as an implicit regularizer. Remarkably, the stabilized magnitude trajectory allows the model to converge smoothly even under extreme spatial compression (R = 1), proving the extreme robustness of the estimator. The code is available at https://github.com/VladimerKhasia/basis