On the Spectral Geometry of Cross-Modal Representations: A Functional Map Diagnostic for Multimodal Alignment

arXiv cs.AI / 4/13/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper analyzes cross-modal alignment between independently pretrained vision (DINOv2) and language (all-MiniLM-L6-v2) encoders using the functional map framework over Laplacian eigenbases.
It finds that the functional map approach underperforms simpler baselines like Procrustes alignment and relative representations for cross-modal retrieval across supervision budgets.
Despite retrieval underperformance, the authors measure that the two encoders have quantitatively similar Laplacian eigenvalue spectra (normalized spectral distance of 0.043), suggesting comparable intrinsic manifold complexity.
However, the functional map shows near-zero diagonal dominance and high orthogonality error (70.15), indicating that the eigenvector bases are effectively misaligned in orientation.
The work introduces the “spectral complexity–orientation gap” concept and proposes diagnostic metrics (diagonal dominance, orthogonality deviation, and Laplacian commutativity error) to characterize cross-modal representation compatibility.

Abstract

We study cross-modal alignment between independently pretrained vision (DINOv2) and language (all-MiniLM-L6-v2) encoders using the functional map framework from computational geometry, which represents correspondence between representation manifolds as a compact linear operator between graph Laplacian eigenbases. While the framework underperforms Procrustes alignment and relative representations for cross-modal retrieval across all supervision budgets, it reveals a structural property of multimodal representations. We find that the Laplacian eigenvalue spectra of the two encoders are quantitatively similar (normalized spectral distance 0.043), indicating that independently trained models develop manifolds of comparable intrinsic complexity. However, the functional map exhibits near-zero diagonal dominance (mean below 0.05) and large orthogonality error (70.15), showing that the eigenvector bases are effectively unaligned. We term this decoupling the spectral complexity--orientation gap: models converge in how much structure they capture but not in how they organize it. This gap defines a boundary condition for spectral alignment methods and motivates three diagnostic quantities : diagonal dominance, orthogonality deviation, and Laplacian commutativity error for characterizing cross-modal representation compatibility.

Why Fashion Trend Prediction Isn’t Enough Without Generative AI

Dev.to

Chatbot vs Voicebot: The Real Business Decision Nobody Talks About

Dev.to

วิธีใช้ AI ทำ SEO ให้เว็บติดอันดับ Google (2026)

Dev.to

Free AI Tools With No Message Limits — The Definitive List (2026)

Dev.to

Why Domain Knowledge Is Critical in Healthcare Machine Learning

Dev.to

On the Spectral Geometry of Cross-Modal Representations: A Functional Map Diagnostic for Multimodal Alignment

Key Points

Abstract

Related Articles

Why Fashion Trend Prediction Isn’t Enough Without Generative AI

Chatbot vs Voicebot: The Real Business Decision Nobody Talks About

วิธีใช้ AI ทำ SEO ให้เว็บติดอันดับ Google (2026)

Free AI Tools With No Message Limits — The Definitive List (2026)

Why Domain Knowledge Is Critical in Healthcare Machine Learning

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer