Learning is Forgetting: LLM Training As Lossy Compression

arXiv cs.CL / 4/10/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes viewing large language models as performing lossy compression: as training proceeds, they retain only information from the training data that is relevant to their objectives.
  • It argues that representations learned during pre-training can be interpreted through an information-theoretic lens, with results approaching the Information Bottleneck bound for compression in next-sequence prediction.
  • Experiments across multiple open-weight LLMs show that different model families compress their knowledge differently, likely reflecting variations in data and training recipes.
  • The authors claim the degree of compression/optimality of a model correlates with the amount of information captured and can predict downstream performance across many benchmarks, linking representational structure to practical outcomes.
  • The work provides a unified information-theoretic framing intended to be usable at scale for understanding how LLMs learn and for deriving actionable insights about model performance.

Abstract

Despite the increasing prevalence of large language models (LLMs), we still have a limited understanding of how their representational spaces are structured. This limits our ability to interpret how and what they learn or relate them to learning in humans. We argue LLMs are best seen as an instance of lossy compression, where over training they learn by retaining only information in their training data relevant to their objective(s). We show pre-training results in models that are optimally compressed for next-sequence prediction, approaching the Information Bottleneck bound on compression. Across an array of open weights models, each compresses differently, likely due to differences in the data and training recipes used. However even across different families of LLMs the optimality of a model's compression, and the information present in it, can predict downstream performance on across a wide array of benchmarks, letting us directly link representational structure to actionable insights about model performance. In the general case the work presented here offers a unified Information-Theoretic framing for how these models learn that is deployable at scale.