UniPROT: Uniform Prototype Selection via Partial Optimal Transport with Submodular Guarantees

arXiv cs.LG / 4/14/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • UniPROT(Uniform Prototype Selection via Partial Optimal Transport)を提案し、ソース分布からプロトタイプ集合を選ぶ際に、均一重み付きプロトタイプ分布とターゲット分布の最適輸送(OT)距離を最小化する枠組みを提示しています。
  • 直感的な定式化は近似困難な超加法的(super-additive)目的のカルディナリティ制約付き最大化になりがちですが、OTの周辺制約を再定式化して部分OTベースのサブモジュラ目的に落とし込みます。
  • この再定式化により、元の問題に対して貪欲法で(1-1/e)の近似保証を与えられることを理論的に示しています。
  • 実験では、偏りのあるデータで少数クラスの表現を改善しつつ多数クラス精度を損なわないことを示し、さらにドメイン不均衡下のLLMの事前学習/微調整でも一様なソース寄与を通じてロバストな性能向上が得られたと報告しています。
  • 提案手法はGitHubでコード公開されており、理論保証付きでスケーラブルに使えるプロトタイプ選択手法として位置付けられています。

Abstract

Selecting prototypical examples from a source distribution to represent a target data distribution is a fundamental problem in machine learning. Existing subset selection methods often rely on implicit importance scores, which can be skewed towards majority classes and lead to low-quality prototypes for minority classes. We present \methodprop, a novel subset selection framework that minimizes the optimal transport (OT) distance between a uniformly weighted prototypical distribution and the target distribution. While intuitive, this formulation leads to a cardinality-constrained maximization of a \emph{super-additive} objective, which is generally intractable to approximate efficiently. To address this, we propose a principled reformulation of the OT marginal constraints, yielding a partial optimal transport-based submodular objective. We prove that this reformulation enables a greedy algorithm with a (1-1/e) approximation guarantee relative to the original super-additive maximization problem. Empirically, we showcase that enforcing uniform prototype weights in UniPROT consistently improves minority-class representation in imbalanced classification benchmarks without compromising majority-class accuracy. In both finetuning and pretraining regimes for large language models under domain imbalance, UniPROT enforces uniform source contributions, yielding robust performance gains. Our results establish UniPROT as a scalable, theoretically grounded solution for uniform-weighted prototype selection. Our code is publicly available at GitHub\footnote{Code: https://github.com/efficiency-learning/UniPROT}