AI Navigate

Marginals Before Conditionals

arXiv cs.AI / 3/12/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper constructs a minimal task that isolates conditional learning in neural networks using a surjective map with K-fold ambiguity resolved by a selector token z, yielding H(A|B) = log K and H(A|B, z) = 0.
  • The model first learns the marginal distribution P(A|B), producing a plateau at height log K whose duration depends on dataset size D rather than K.
  • Gradient noise stabilizes the marginal solution: higher learning rates slow the transition, and smaller batch sizes delay the escape, consistent with an entropic force opposing departure from the low-gradient marginal.
  • A selector-routing head assembles during the plateau and leads the loss transition by about 50% of the waiting time, illustrating Type 2 directional asymmetry.
  • The study tracks the excess risk from log K to zero to analyze what stabilizes or triggers its collapse and how long it takes.

Abstract

We construct a minimal task that isolates conditional learning in neural networks: a surjective map with K-fold ambiguity, resolved by a selector token z, so H(A | B) = log K while H(A | B, z) = 0. The model learns the marginal P(A | B) first, producing a plateau at exactly log K, before acquiring the full conditional in a sharp, collective transition. The plateau has a clean decomposition: height = log K (set by ambiguity), duration = f(D) (set by dataset size D, not K). Gradient noise stabilizes the marginal solution: higher learning rates monotonically slow the transition (3.6* across a 7* {\eta} range at fixed throughput), and batch-size reduction delays escape, consistent with an entropic force opposing departure from the low-gradient marginal. Internally, a selector-routing head assembles during the plateau, leading the loss transition by ~50% of the waiting time. This is the Type 2 directional asymmetry of Papadopoulos et al. [2024], measured dynamically: we track the excess risk from log K to zero and characterize what stabilizes it, what triggers its collapse, and how long it takes.