AI Navigate

Shape-of-You: Fused Gromov-Wasserstein Optimal Transport for Semantic Correspondence in-the-Wild

arXiv cs.CV / 3/13/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper reformulates pseudo-label generation as a fused Gromov-Wasserstein (FGW) problem to jointly optimize inter-feature similarity and intra-structural consistency for unsupervised semantic correspondence in the wild.
  • Shape-of-You (SoY) uses a 3D foundation model to define the intra-structure in geometric space, addressing ambiguities from symmetry and repetitive features that 2D appearance alone cannot resolve.
  • Because FGW is quadratic and computationally heavy, the authors approximate it with anchor-based linearization, yielding a probabilistic transport plan as a noisy supervisory signal.
  • A soft-target loss dynamically blends guidance from the transport plan with network predictions to build a learning framework that's robust to noise and annotation absence.
  • SoY achieves state-of-the-art results on SPair-71k and AP-10k benchmarks and provides code at Shape-of-You.

Abstract

Semantic correspondence is essential for handling diverse in-the-wild images lacking explicit correspondence annotations. While recent 2D foundation models offer powerful features, adapting them for unsupervised learning via nearest-neighbor pseudo-labels has key limitations: it operates locally, ignoring structural relationships, and consequently its reliance on 2D appearance fails to resolve geometric ambiguities arising from symmetries or repetitive features. In this work, we address this by reformulating pseudo-label generation as a Fused Gromov-Wasserstein (FGW) problem, which jointly optimizes inter-feature similarity and intra-structural consistency. Our framework, Shape-of-You (SoY), leverages a 3D foundation model to define this intra-structure in the geometric space, resolving abovementioned ambiguity. However, since FGW is a computationally prohibitive quadratic problem, we approximate it through anchor-based linearization. The resulting probabilistic transport plan provides a structurally consistent but noisy supervisory signal. Thus, we introduce a soft-target loss dynamically blending guidance from this plan with network predictions to build a learning framework robust to this noise. SoY achieves state-of-the-art performance on SPair-71k and AP-10k datasets, establishing a new benchmark in semantic correspondence without explicit geometric annotations. Code is available at Shape-of-You.