EMA Is Not All You Need: Mapping the Boundary Between Structure and Content in Recurrent Context

arXiv cs.AI / 4/13/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper uses exponential moving average (EMA) traces as a controlled probe to determine what fixed-coefficient recurrent context can represent versus what it fundamentally cannot.
EMA traces are shown to encode temporal structure effectively, with a Hebbian multi-timescale approach reaching 96% of a supervised BiGRU on grammatical role assignment without labels and even outperforming it on structure-dependent roles.
The study finds EMA traces eliminate token identity, and a 130M-parameter language model relying only on EMA context achieves C4 perplexity 260 (about 8× GPT-2), indicating major limits on content retention.
A predictor ablation (replacing a linear predictor with full softmax attention) yields identical loss, localizing the performance gap specifically to information discarded by the traces.
The authors argue that EMA traces perform lossy, data-independent compression; by the data processing inequality, no downstream predictor can recover discarded information, implying that only learned, input-dependent selection can overcome fixed accumulation’s irreversible dilution.

Abstract

What exactly do efficient sequence models gain over simple temporal averaging? We use exponential moving average (EMA) traces, the simplest recurrent context (no gating, no content-based retrieval), as a controlled probe to map the boundary between what fixed-coefficient accumulation can and cannot represent. EMA traces encode temporal structure: a Hebbian architecture with multi-timescale traces achieves 96% of a supervised BiGRU on grammatical role assignment with zero labels, surpassing the supervised model on structure-dependent roles. EMA traces destroy token identity: a 130M-parameter language model using only EMA context reaches C4 perplexity 260 (8x GPT-2), and a predictor ablation (replacing the linear predictor with full softmax attention) yields identical loss, localizing the entire gap to the traces. The traces apply lossy, data-independent compression; by the data processing inequality, no downstream predictor can recover the discarded information. Fixed-coefficient accumulation, whether across time or depth, suffers irreversible information dilution that only learned, input-dependent selection can resolve.