Model-agnostic information transfer and fusion for classification with label noise

arXiv stat.ML / 4/29/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper tackles learning from datasets with label noise by using a common “coarse noisy labels + small clean expert-labeled set” paradigm, framing it as an information transfer and fusion problem.
  • It argues that existing statistical transfer learning methods break down because of a substantial distribution shift between noisy and clean data and because they assume parametric models that are a poor fit for complex inputs like images.
  • The authors propose a model-agnostic, nonparametric classification framework that can work across a broad class of classifiers rather than being tied to specific model architectures.
  • The method uses the small clean dataset to “purify” the larger noisy dataset while explicitly handling the remaining ambiguous samples, backed by a rigorous statistical theory.
  • Experiments include simulations and a medical imaging case study for pneumonia diagnosis, showing practical effectiveness of the framework.

Abstract

Label noise presents a fundamental challenge in modern machine learning, especially when large-scale datasets are generated via automated processes. An increasingly common and important data paradigm, particularly in domains like medical imaging, involves learning from a large dataset with coarse, noisy labels supplemented by a small, expert-verified, clean dataset. This setting constitutes a typical information transfer and fusion problem. However, the significant distribution shift between the noisy and clean data violates the core overall parametric similarity assumptions of existing statistical transfer learning methods, while their reliance on parametric models is ill-suited for complex data like images. To address these limitations, this paper develops a generic model-agnostic nonparametric framework for classification with label noise, which applies to a broad class of classifiers. Our approach leverages the small clean dataset to ``purify'' the large noisy one and carefully manages the remaining ambiguous samples. This framework is underpinned by a rigorous statistical theory. Its empirical performance is demonstrated through simulations and a real-world application to medical image analysis for pneumonia diagnosis.