Hybrid Associative Memories

arXiv cs.AI / 3/25/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • RNNs and self-attention use fundamentally different memory mechanisms: RNNs compress history into a fixed-size state, while self-attention stores past time steps via a KV cache that grows with sequence length.
  • The paper argues that naive hybridization (e.g., simple interleaving) misses these complementary strengths and weaknesses.
  • It proposes a Hybrid Associative Memory (HAM) layer that uses an RNN to summarize the full sequence while letting attention add only the “hard-to-predict” information, yielding data-dependent KV cache growth.
  • HAM introduces a user-controllable, continuous threshold to precisely regulate KV-cache expansion, enabling a smooth loss/performance trade-off.
  • Experiments indicate HAM can match or outperform competitive RNN/Transformer performance while using substantially less KV-cache than standard attention approaches.

Abstract

Recurrent neural networks (RNNs) and self-attention are both widely used sequence-mixing layers that maintain an internal memory. However, this memory is constructed using two orthogonal mechanisms: RNNs compress the entire past into a fixed-size state, whereas self-attention's state stores every past time step growing its state (the KV cache) linearly with the sequence length. This results in orthogonal strengths and weaknesses. Self-attention layers excel at retrieving information in the context but have large memory and computational costs, while RNNs are more efficient but degrade over longer contexts and underperform for precise recall tasks. Prior work combining these mechanisms has focused primarily on naively interleaving them to reduce computational cost without regard to their complementary mechanisms. We propose the Hybrid Associative Memory (HAM) layer, which combines self-attention and RNNs while leveraging their individual strengths: the RNN compresses the entire sequence, while attention supplements it *only* with information that is difficult for the RNN to predict, which is hence the most valuable information to explicitly store. HAM layers enable data-dependent growth of the KV cache, which can be precisely controlled by the user with a single, continuous threshold. We find that this fine-grained control of the KV cache growth rate has a smooth trade-off with loss and performance. Empirically, we show that our hybrid architecture offers strong, competitive performance relative to RNNs and Transformers even at substantially lower KV-cache usage.