Marginals Before Conditionals
arXiv cs.AI / 3/12/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper constructs a minimal task that isolates conditional learning in neural networks using a surjective map with K-fold ambiguity resolved by a selector token z, yielding H(A|B) = log K and H(A|B, z) = 0.
- The model first learns the marginal distribution P(A|B), producing a plateau at height log K whose duration depends on dataset size D rather than K.
- Gradient noise stabilizes the marginal solution: higher learning rates slow the transition, and smaller batch sizes delay the escape, consistent with an entropic force opposing departure from the low-gradient marginal.
- A selector-routing head assembles during the plateau and leads the loss transition by about 50% of the waiting time, illustrating Type 2 directional asymmetry.
- The study tracks the excess risk from log K to zero to analyze what stabilizes or triggers its collapse and how long it takes.




