The Power of Power Law: Asymmetry Enables Compositional Reasoning

arXiv cs.AI / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that natural-language knowledge and skills follow a power-law distribution, and—contrary to the common intuition—training on power-law sampled data can outperform training on uniform data for compositional reasoning tasks.
  • The reported gains span multiple compositional reasoning settings, including state tracking and multi-step arithmetic, where the model must combine skills across steps.
  • The authors introduce a simplified skill-composition benchmark and show theoretically that power-law training requires substantially less data to achieve effective learning than uniform training.
  • The analysis attributes the advantage to “beneficial asymmetry” from power-law sampling, which improves the loss landscape and helps models first learn frequent skill compositions before efficiently tackling rare long-tail skills.
  • Overall, the work reframes how to choose training data distributions for compositional reasoning, suggesting that non-uniform (power-law) sampling may be inherently more effective than enforcing uniformity.

Abstract

Natural language data follows a power-law distribution, with most knowledge and skills appearing at very low frequency. While a common intuition suggests that reweighting or curating data towards a uniform distribution may help models better learn these long-tail skills, we find a counterintuitive result: across a wide range of compositional reasoning tasks, such as state tracking and multi-step arithmetic, training under power-law distributions consistently outperforms training under uniform distributions. To understand this advantage, we introduce a minimalist skill-composition task and show that learning under a power-law distribution provably requires significantly less training data. Our theoretical analysis reveals that power law sampling induces a beneficial asymmetry that improves the pathological loss landscape, which enables models to first acquire high-frequency skill compositions with low data complexity, which in turn serves as a stepping stone to efficiently learn rare long-tailed skills. Our results offer an alternative perspective on what constitutes an effective data distribution for training models.