Quantifying Data Similarity Using Cross Learning
arXiv stat.ML / 4/22/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that existing dataset-similarity methods often ignore label information and alignment between features and responses, which can limit transfer learning and domain adaptation.
- It proposes the Cross-Learning Score (CLS), a similarity metric based on bidirectional generalization performance of decision rules rather than only input feature distributions.
- The authors provide a theoretical and geometric interpretation by relating CLS to cosine similarity between decision boundaries in canonical linear models.
- They develop a robust, ensemble-based estimator that avoids high-dimensional density estimation, and they extend the idea to deep learning using encoder-head architectures.
- For transfer learning, the paper introduces a “transferable zones” framework that divides source datasets into positive, ambiguous, and negative transfer regions, validated via extensive experiments.


