Quantifying Data Similarity Using Cross Learning

arXiv stat.ML / 4/22/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that existing dataset-similarity methods often ignore label information and alignment between features and responses, which can limit transfer learning and domain adaptation.
It proposes the Cross-Learning Score (CLS), a similarity metric based on bidirectional generalization performance of decision rules rather than only input feature distributions.
The authors provide a theoretical and geometric interpretation by relating CLS to cosine similarity between decision boundaries in canonical linear models.
They develop a robust, ensemble-based estimator that avoids high-dimensional density estimation, and they extend the idea to deep learning using encoder-head architectures.
For transfer learning, the paper introduces a “transferable zones” framework that divides source datasets into positive, ambiguous, and negative transfer regions, validated via extensive experiments.

Abstract

Measuring dataset similarity is fundamental in machine learning, particularly for transfer learning and domain adaptation. In the context of supervised learning, most existing approaches quantify similarity of two data sets based on their input feature distributions, neglecting label information and feature-response alignment. To address this, we propose the Cross-Learning Score (CLS), which measures dataset similarity through bidirectional generalization performance of decision rules. We establish its theoretical foundation by linking CLS to cosine similarity between decision boundaries under canonical linear models, providing a geometric interpretation. A robust ensemble-based estimator is developed that is easy to implement and bypasses high-dimensional density estimation entirely. For transfer learning applications, we introduce a "transferable zones" framework that categorizes source datasets into positive, ambiguous, and negative transfer regions. To accommodate deep learning, we extend CLS to encoder-head architectures, aligning with modern representation-based pipelines. Extensive experiments on synthetic and real-world datasets validate the effectiveness of CLS for similarity measurement and transfer assessment.

Autoencoders and Representation Learning in Vision

Dev.to

Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks

Dev.to

Context Bloat in AI Agents

Dev.to

We open sourced the AI dev team that builds our product

Dev.to

Qwen 3.6 35B A3B vs Qwen 3.5 122B A10B

Reddit r/LocalLLaMA

Quantifying Data Similarity Using Cross Learning

Key Points

Abstract

Related Articles

Autoencoders and Representation Learning in Vision

Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks

Context Bloat in AI Agents

We open sourced the AI dev team that builds our product

Qwen 3.6 35B A3B vs Qwen 3.5 122B A10B

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer