Quantifying Data Similarity Using Cross Learning

arXiv stat.ML / 4/22/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that existing dataset-similarity methods often ignore label information and alignment between features and responses, which can limit transfer learning and domain adaptation.
  • It proposes the Cross-Learning Score (CLS), a similarity metric based on bidirectional generalization performance of decision rules rather than only input feature distributions.
  • The authors provide a theoretical and geometric interpretation by relating CLS to cosine similarity between decision boundaries in canonical linear models.
  • They develop a robust, ensemble-based estimator that avoids high-dimensional density estimation, and they extend the idea to deep learning using encoder-head architectures.
  • For transfer learning, the paper introduces a “transferable zones” framework that divides source datasets into positive, ambiguous, and negative transfer regions, validated via extensive experiments.

Abstract

Measuring dataset similarity is fundamental in machine learning, particularly for transfer learning and domain adaptation. In the context of supervised learning, most existing approaches quantify similarity of two data sets based on their input feature distributions, neglecting label information and feature-response alignment. To address this, we propose the Cross-Learning Score (CLS), which measures dataset similarity through bidirectional generalization performance of decision rules. We establish its theoretical foundation by linking CLS to cosine similarity between decision boundaries under canonical linear models, providing a geometric interpretation. A robust ensemble-based estimator is developed that is easy to implement and bypasses high-dimensional density estimation entirely. For transfer learning applications, we introduce a "transferable zones" framework that categorizes source datasets into positive, ambiguous, and negative transfer regions. To accommodate deep learning, we extend CLS to encoder-head architectures, aligning with modern representation-based pipelines. Extensive experiments on synthetic and real-world datasets validate the effectiveness of CLS for similarity measurement and transfer assessment.