Joint Embedding Variational Bayes

arXiv stat.ML / 3/31/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Variational Joint Embedding (VJE), a reconstruction-free, non-contrastive self-supervised learning framework that uses a latent-variable variational formulation in representation space.
  • VJE maximizes a symmetric conditional ELBO by defining a likelihood directly on target embeddings, avoiding pointwise compatibility objectives and enabling probabilistic semantics in learned representations.
  • The conditional likelihood is modeled with a heavy-tailed Student-t distribution over a polar representation of target embeddings, using directional–radial decomposition to separate angular alignment from magnitude consistency and reduce norm-related issues.
  • An amortized inference network produces a diagonal Gaussian posterior with uncertainty that is tied to the directional likelihood’s feature-wise variances, yielding anisotropic uncertainty without additional projection heads.
  • Experiments on ImageNet-1K, CIFAR-10/100, and STL-10 show VJE is competitive on linear and k-NN evaluations and improves out-of-distribution detection using representation-space likelihoods.

Abstract

We introduce Variational Joint Embedding (VJE), a reconstruction-free latent-variable framework for non-contrastive self-supervised learning in representation space. VJE maximizes a symmetric conditional evidence lower bound (ELBO) on paired encoder embeddings by defining a conditional likelihood directly on target representations, rather than optimizing a pointwise compatibility objective. The likelihood is instantiated as a heavy-tailed Student--\(t\) distribution on a polar representation of the target embedding, where a directional--radial decomposition separates angular agreement from magnitude consistency and mitigates norm-induced pathologies. The directional factor operates on the unit sphere, yielding a valid variational bound for the associated spherical subdensity model. An amortized inference network parameterizes a diagonal Gaussian posterior whose feature-wise variances are shared with the directional likelihood, yielding anisotropic uncertainty without auxiliary projection heads. Across ImageNet-1K, CIFAR-10/100, and STL-10, VJE is competitive with standard non-contrastive baselines under linear and \(k\)-NN evaluation, while providing probabilistic semantics directly in representation space for downstream uncertainty-aware applications. We validate these semantics through out-of-distribution detection, where representation-space likelihoods yield strong empirical performance. These results position the framework as a principled variational formulation of non-contrastive learning, in which structured feature-wise uncertainty is represented directly in the learned embedding space.