On the Spectral Geometry of Cross-Modal Representations: A Functional Map Diagnostic for Multimodal Alignment

arXiv cs.AI / 4/13/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper analyzes cross-modal alignment between independently pretrained vision (DINOv2) and language (all-MiniLM-L6-v2) encoders using the functional map framework over Laplacian eigenbases.
  • It finds that the functional map approach underperforms simpler baselines like Procrustes alignment and relative representations for cross-modal retrieval across supervision budgets.
  • Despite retrieval underperformance, the authors measure that the two encoders have quantitatively similar Laplacian eigenvalue spectra (normalized spectral distance of 0.043), suggesting comparable intrinsic manifold complexity.
  • However, the functional map shows near-zero diagonal dominance and high orthogonality error (70.15), indicating that the eigenvector bases are effectively misaligned in orientation.
  • The work introduces the “spectral complexity–orientation gap” concept and proposes diagnostic metrics (diagonal dominance, orthogonality deviation, and Laplacian commutativity error) to characterize cross-modal representation compatibility.

Abstract

We study cross-modal alignment between independently pretrained vision (DINOv2) and language (all-MiniLM-L6-v2) encoders using the functional map framework from computational geometry, which represents correspondence between representation manifolds as a compact linear operator between graph Laplacian eigenbases. While the framework underperforms Procrustes alignment and relative representations for cross-modal retrieval across all supervision budgets, it reveals a structural property of multimodal representations. We find that the Laplacian eigenvalue spectra of the two encoders are quantitatively similar (normalized spectral distance 0.043), indicating that independently trained models develop manifolds of comparable intrinsic complexity. However, the functional map exhibits near-zero diagonal dominance (mean below 0.05) and large orthogonality error (70.15), showing that the eigenvector bases are effectively unaligned. We term this decoupling the spectral complexity--orientation gap: models converge in how much structure they capture but not in how they organize it. This gap defines a boundary condition for spectral alignment methods and motivates three diagnostic quantities : diagonal dominance, orthogonality deviation, and Laplacian commutativity error for characterizing cross-modal representation compatibility.