Out-of-Distribution Generalization of In-Context Learning: A Low-Dimensional Subspace Perspective

arXiv stat.ML / 4/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses a key gap in understanding when in-context learning (ICL) can or cannot generalize beyond the pre-training data distribution.
It introduces a minimal, provable mathematical model using linear regression tasks with low-rank covariance, treating distribution shift as changing angles between subspaces.
The authors derive conditions under which a single-layer linear attention model can interpolate across all subspace angles, enabling ICL generalization even to test regions with zero training probability mass.
They show a contrasting result: when pre-training tasks come from a single Gaussian, test risk depends on the angle, indicating ICL fails to generalize out-of-distribution (OOD) in that setting.
Empirical and extension experiments suggest the insights also apply to architectures like GPT-2 and can extend to nonlinear function classes.

Abstract

The transformer's remarkable ability to perform in-context learning (ICL) has sparked a wide range of studies designed to understand its strengths and limitations. However, a theoretical understanding of when ICL can and cannot generalize beyond its pre-training data still remains unclear. This paper puts forth a minimal mathematical model that provably identifies when ICL can generalize out-of-distribution (OOD). By studying linear regression tasks parameterized with low-rank covariance matrices, we model distribution shifts as varying angles between subspaces and derive conditions under which a single-layer linear attention model interpolates across all angles. We show that if pre-training task vectors are drawn from a union of subspaces, transformers can generalize to all angle shifts--enabling ICL even in regions with zero probability mass in the training distribution. On the other hand, if the pre-training tasks are drawn from a single Gaussian, the test risk shows a non-negligible dependence on the angle, implying that ICL cannot generalize OOD. We empirically show that our results also hold for models such as GPT-2, and present experiments on how our results extend to nonlinear function classes.

Building a Local AI Agent (Part 2): Six UX and UI Design Challenges

Dev.to

We Built a DNS-Based Discovery Protocol for AI Agents — Here's How It Works

Dev.to

Your first business opportunity in 3 commands: /register_directory in @biznode_bot, wait for matches, then /my_pulse to view...

Dev.to

Building AI Evaluation Pipelines: Automating LLM Testing from Dataset to CI/CD

Dev.to

Function Calling Harness 2: CoT Compliance from 9.91% to 100%

Dev.to

Out-of-Distribution Generalization of In-Context Learning: A Low-Dimensional Subspace Perspective

Key Points

Abstract

Related Articles

Building a Local AI Agent (Part 2): Six UX and UI Design Challenges

We Built a DNS-Based Discovery Protocol for AI Agents — Here's How It Works

Your first business opportunity in 3 commands: /register_directory in @biznode_bot, wait for matches, then /my_pulse to view...

Building AI Evaluation Pipelines: Automating LLM Testing from Dataset to CI/CD

Function Calling Harness 2: CoT Compliance from 9.91% to 100%

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer