Out-of-Distribution Generalization of In-Context Learning: A Low-Dimensional Subspace Perspective

arXiv stat.ML / 4/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses a key gap in understanding when in-context learning (ICL) can or cannot generalize beyond the pre-training data distribution.
  • It introduces a minimal, provable mathematical model using linear regression tasks with low-rank covariance, treating distribution shift as changing angles between subspaces.
  • The authors derive conditions under which a single-layer linear attention model can interpolate across all subspace angles, enabling ICL generalization even to test regions with zero training probability mass.
  • They show a contrasting result: when pre-training tasks come from a single Gaussian, test risk depends on the angle, indicating ICL fails to generalize out-of-distribution (OOD) in that setting.
  • Empirical and extension experiments suggest the insights also apply to architectures like GPT-2 and can extend to nonlinear function classes.

Abstract

The transformer's remarkable ability to perform in-context learning (ICL) has sparked a wide range of studies designed to understand its strengths and limitations. However, a theoretical understanding of when ICL can and cannot generalize beyond its pre-training data still remains unclear. This paper puts forth a minimal mathematical model that provably identifies when ICL can generalize out-of-distribution (OOD). By studying linear regression tasks parameterized with low-rank covariance matrices, we model distribution shifts as varying angles between subspaces and derive conditions under which a single-layer linear attention model interpolates across all angles. We show that if pre-training task vectors are drawn from a union of subspaces, transformers can generalize to all angle shifts--enabling ICL even in regions with zero probability mass in the training distribution. On the other hand, if the pre-training tasks are drawn from a single Gaussian, the test risk shows a non-negligible dependence on the angle, implying that ICL cannot generalize OOD. We empirically show that our results also hold for models such as GPT-2, and present experiments on how our results extend to nonlinear function classes.