Gaussian Joint Embeddings For Self-Supervised Representation Learning

arXiv cs.LG / 3/31/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

本論文は、自己教師あり表現学習で文脈・目標の潜在表現を合わせる際に、決定論的な予測（回帰）に代わる確率的な生成ジョイントモデリングとしてGaussian Joint Embeddings (GJE) とGaussian Mixture Joint Embeddings (GMJE) を提案している。
GJE/GMJEは文脈と目標表現の同時密度をモデル化し、ブラックボックス予測ではなく明示的な確率モデルに基づく閉形式の条件付き推論で学習・推論を行うことで、潜在幾何を制御する共分散を考慮した目的関数と不確実性推定を可能にする。
著者らは経験的なバッチ最適化で起こる失敗モード「Mahalanobis Trace Trap」を特定し、それに対する複数の対策（プロトタイプベースGMJE、GMJE-MDN、GMJE-GNG、SMCメモリバンクなど）を提示している。
標準的なコントラスト学習はGMJEの退化した非パラメトリック極限として解釈できることを示し、既存手法との理論的な接続を与えている。
合成のマルチモーダル整合タスクと視覚ベンチマークの実験では、GMJEが複雑な条件構造の復元、競争的な識別表現の学習、より良い潜在密度（無条件サンプリング適性）を示したとしている。

Abstract

Self-supervised representation learning often relies on deterministic predictive architectures to align context and target views in latent space. While effective in many settings, such methods are limited in genuinely multi-modal inverse problems, where squared-loss prediction collapses towards conditional averages, and they frequently depend on architectural asymmetries to prevent representation collapse. In this work, we propose a probabilistic alternative based on generative joint modeling. We introduce Gaussian Joint Embeddings (GJE) and its multi-modal extension, Gaussian Mixture Joint Embeddings (GMJE), which model the joint density of context and target representations and replace black-box prediction with closed-form conditional inference under an explicit probabilistic model. This yields principled uncertainty estimates and a covariance-aware objective for controlling latent geometry. We further identify a failure mode of naive empirical batch optimization, which we term the Mahalanobis Trace Trap, and develop several remedies spanning parametric, adaptive, and non-parametric settings, including prototype-based GMJE, conditional Mixture Density Networks (GMJE-MDN), topology-adaptive Growing Neural Gas (GMJE-GNG), and a Sequential Monte Carlo (SMC) memory bank. In addition, we show that standard contrastive learning can be interpreted as a degenerate non-parametric limiting case of the GMJE framework. Experiments on synthetic multi-modal alignment tasks and vision benchmarks show that GMJE recovers complex conditional structure, learns competitive discriminative representations, and defines latent densities that are better suited to unconditional sampling than deterministic or unimodal baselines.