Controlled Paraphrase Geometry in Sentence Embedding Space: Local Manifold Modeling and Latent Probing

arXiv cs.CL / 5/5/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper examines how controlled, paraphrase-like semantic variations are organized locally within sentence embedding space, focusing on the geometry of embedding “clouds.”
It proposes local geometric modeling using fitted low-degree carrier functions (affine, quadratic, and cubic), and introduces a surface-based latent probing method that generates synthetic latent points in a reduced local PCA space.
The generated points are evaluated for fidelity to the fitted surface, preservation of neighborhood structure, agreement with the empirical embedding distribution, and stability of second-order (Hessian) shape descriptors and fitted coefficients.
Experiments indicate that nonlinear local models (quadratic/cubic) capture embedding clouds more accurately than affine models, with surface-based generation achieving strong geometric consistency (including Hessian and coefficient consistency).
However, downstream tasks show that geometric validity of synthetic latent points does not necessarily improve classification performance, emphasizing the difference between “geometry-aware validity” and “discriminative utility,” and the work also releases the CoPaGE-300K dataset.

Abstract

The paper studies the local geometry of embedding clouds induced by \emph{controlled local classes of semantically close sentences}. The central question is how controlled paraphrase-like semantic variation is organized in sentence embedding space and whether this local structure can be explicitly modeled by low-degree fitted carriers. We introduce a local geometric modeling scheme based on affine, quadratic, and cubic fitted models. We also use a surface-based latent probing procedure that constructs synthetic latent points in a reduced local PCA space with respect to the fitted carrier. The procedure is intended as an offline method for representation-space analysis, local manifold modeling, and geometry-aware latent probing. Generated latent points are evaluated using criteria that measure consistency with the fitted surface, preservation of neighborhood structure, agreement with the empirical distribution, stability of Hessian-based second-order shape descriptors, and stability of fitted-model coefficients. Experiments on controlled sets of semantically close sentences show that nonlinear local models describe embedding clouds more accurately than affine models. Surface-based generation provides strong fitted-geometry fidelity, including surface consistency, Hessian-based shape consistency, and coefficient consistency. Downstream experiments show that geometric validity of synthetic latent points does not automatically translate into improved classification performance. The results support explicit local geometric modeling of sentence embedding space and highlight the need to distinguish geometric validity from discriminative utility. As a resource contribution, we introduce \textbf{CoPaGE-300K}, a controlled template-based dataset of semantically close sentence variants with slot-level annotations and precomputed sentence embeddings.