Absorber LLM: Harnessing Causal Synchronization for Test-Time Training

arXiv cs.LG / 4/24/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The paper introduces “Absorber LLM,” targeting the high compute/memory cost of transformer self-attention in long-sequence and streaming inference.
  • It argues that fixed-state alternatives (e.g., RNNs/SSMs) can lose long-tail dependencies, while Test-Time Training (TTT) risks overfitting and fails to preserve causal effects from the pretrained LLM’s context.
  • Absorber LLM reframes long-context retention as self-supervised causal synchronization, training a contextless model whose future generations should match the original model’s outputs.
  • The method synchronizes internal behaviors between the updated and original models to improve both context absorption and generalization.
  • Experiments on long-context and streaming benchmarks show lower inference memory usage and better accuracy than prior “parameter-as-memory” approaches.

Abstract

Transformers suffer from a high computational cost that grows with sequence length for self-attention, making inference in long streams prohibited by memory consumption. Constant-memory alternatives such as RNNs and SSMs compress history into states with fixed size and thus lose long-tail dependencies, while methods that memorize contexts into parameters, such as Test-Time Training (TTT), are prone to overfitting token-level projection and fail to preserve the causal effect of context in pretrained LLMs. We propose Absorber LLM, which formulates long-context retention as a self-supervised causal synchronization: after absorbing historical contexts into parameters, a contextless model should match the original model with full context on future generations. We optimize this objective by synchronizing internal behaviors of the updated model with the original one, ensuring context absorption and generalization. Experiments on long-context and streaming benchmarks show that Absorber LLM reduces inference memory and improves accuracy over prior parameter-as-memory baselines.