Distinct mechanisms underlying in-context learning in transformers

arXiv cs.LG / 4/15/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper provides a mechanistic characterization of in-context learning in transformers trained on a finite set of discrete Markov-chain systems.
It identifies four algorithmic phases that depend on whether the model memorizes vs. generalizes and whether it relies on 1-point or 2-point statistics from the input.
The authors argue these phases are implemented by multi-layer subcircuits embodying two qualitatively distinct motifs for context-adaptive computation.
Two data-diversity-dependent boundaries, K1* (from a kinetic competition between subcircuits) and K2* (from a representational bottleneck), separate memorization/generalization regimes.
A symmetry-constrained theory of training dynamics is used to explain the sharp transition from 1-point to 2-point generalization and the loss-landscape properties that enable generalization.

Abstract

Modern distributed networks, notably transformers, acquire a remarkable ability (termed `in-context learning') to adapt their computation to input statistics, such that a fixed network can be applied to data from a broad range of systems. Here, we provide a complete mechanistic characterization of this behavior in transformers trained on a finite set

S

of discrete Markov chains. The transformer displays four algorithmic phases, characterized by whether the network memorizes and generalizes, and whether it uses 1-point or 2-point statistics. We show that the four phases are implemented by multi-layer subcircuits that exemplify two qualitatively distinct mechanisms for implementing context-adaptive computations. Minimal models isolate the key features of both motifs. Memorization and generalization phases are delineated by two boundaries that depend on data diversity,

K = |S|

. The first (

K_1^\ast

) is set by a kinetic competition between subcircuits and the second (

K_2^\ast

) is set by a representational bottleneck. A symmetry-constrained theory of a transformer's training dynamics explains the sharp transition from 1-point to 2-point generalization and identifies key features of the loss landscape that allow the network to generalize. Put together, we show that transformers develop distinct subcircuits to implement in-context learning and identify conditions that favor certain mechanisms over others.

RAG in Practice — Part 4: Chunking, Retrieval, and the Decisions That Break RAG

Dev.to

Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]

Reddit r/MachineLearning

How AI Interview Assistants Are Changing Job Preparation in 2026

Dev.to

Consciousness in Artificial Intelligence: Insights from the Science ofConsciousness

Dev.to

NEW PROMPT INJECTION

Dev.to

Distinct mechanisms underlying in-context learning in transformers

Key Points

Abstract

Related Articles

RAG in Practice — Part 4: Chunking, Retrieval, and the Decisions That Break RAG

Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]

How AI Interview Assistants Are Changing Job Preparation in 2026

Consciousness in Artificial Intelligence: Insights from the Science ofConsciousness

NEW PROMPT INJECTION

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer