Learning to Adapt: In-Context Learning Beyond Stationarity

arXiv cs.LG / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper investigates how transformer in-context learning (ICL) behaves when task relationships are non-stationary, i.e., the underlying input-output mapping changes over time.
  • It provides a theoretical analysis for non-stationary regression settings and models evolution using a first-order autoregressive process.
  • The authors argue that gated linear attention (GLA) adaptively adjusts how much past inputs influence predictions, effectively learning a recency bias.
  • They show (theoretically and empirically) that GLA can achieve lower training and testing errors than standard linear attention in these dynamic ICL tasks.
  • Experiments validate the usefulness of gating mechanisms for ICL under shifting data-generating processes, bridging an assumption gap in prior stationary-focused analyses.

Abstract

Transformer models have become foundational across a wide range of scientific and engineering domains due to their strong empirical performance. A key capability underlying their success is in-context learning (ICL): when presented with a short prompt from an unseen task, transformers can perform per-token and next-token predictions without any parameter updates. Recent theoretical efforts have begun to uncover the mechanisms behind this phenomenon, particularly in supervised regression settings. However, these analyses predominantly assume stationary task distributions, which overlook a broad class of real-world scenarios where the target function varies over time. In this work, we bridge this gap by providing a theoretical analysis of ICL under non-stationary regression problems. We study how the gated linear attention (GLA) mechanism adapts to evolving input-output relationships and rigorously characterize its advantages over standard linear attention in this dynamic setting. To model non-stationarity, we adopt a first-order autoregressive process and show that GLA achieves lower training and testing errors by adaptively modulating the influence of past inputs -- effectively implementing a learnable recency bias. Our theoretical findings are further supported by empirical results, which validate the benefits of gating mechanisms in non-stationary ICL tasks.