Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

Apple Machine Learning Journal / 4/13/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that pruning training data can improve a model’s ability to memorize specific facts, making “less data” yield better factual retention.
It focuses on how selectively removing parts of the training set affects memorization behavior, rather than only overall generalization.
The work is presented as an ICLR workshop paper and contributes evidence for training-data curation as a lever for controllable memorization of factual content.
The authors frame the approach around efficiency and data selection, suggesting practical ways to influence what information models store internally.
The findings have implications for how datasets are cleaned and curated when the goal is reliable fact recall or minimizing irrelevant data influence.

This paper was accepted at the Workshop on Navigating and Addressing Data Problems for Foundation Models at ICLR 2026. Large language models (LLMs) can struggle to memorize factual knowledge in their parameters, often leading to hallucinations and poor performance on knowledge-intensive tasks. In this paper, we formalize fact memorization from an information-theoretic perspective and study how training data distributions affect fact accuracy. We show that fact accuracy is suboptimal (below the capacity limit) whenever the amount of information contained in the training data facts exceeds model…

Continue reading this article on the original site.

Read original →