Semantic Shift: the Fundamental Challenge in Text Embedding and Retrieval

arXiv cs.CL / 3/24/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

本研究は、Transformerベースのテキスト埋め込みで起きる幾何学的病理（異方性や長さによる埋め込み崩壊）について、従来の“見え方”の説明に加えて「いつ・なぜ」検索精度へ悪影響を与えるかの因果要因として「semantic shift（意味のシフト）」を提案している。
semantic smoothingの理論分析により、文の集合内で意味の多様性が増えるほど、プール後表現は各文の埋め込みから外れて“平滑化され、識別力が下がる”ことを示している。
semantic shiftを、局所的な意味の変化と大域的な意味の分散を統合する“計算可能な指標”として定式化し、複数の埋め込みモデルとコーパスでの制御実験により、semantic shiftが埋め込みの集中度合いと強く整合し、検索劣化を予測することを報告している。
テキスト長だけでは劣化を説明しきれない一方で、semantic shiftを用いることでanisotropyが“有害になる条件”を診断できる、統一的で実用的な観点を提供するとしている。

Abstract

Transformer-based embedding models rely on pooling to map variable-length text into a single vector, enabling efficient similarity search but also inducing well-known geometric pathologies such as anisotropy and length-induced embedding collapse. Existing accounts largely describe \emph{what} these pathologies look like, yet provide limited insight into \emph{when} and \emph{why} they harm downstream retrieval. In this work, we argue that the missing causal factor is \emph{semantic shift}: the intrinsic, structured evolution and dispersion of semantics within a text. We first present a theoretical analysis of \emph{semantic smoothing} in Transformer embeddings: as the semantic diversity among constituent sentences increases, the pooled representation necessarily shifts away from every individual sentence embedding, yielding a smoothed and less discriminative vector. Building on this foundation, we formalize semantic shift as a computable measure integrating local semantic evolution and global semantic dispersion. Through controlled experiments across corpora and multiple embedding models, we show that semantic shift aligns closely with the severity of embedding concentration and predicts retrieval degradation, whereas text length alone does not. Overall, semantic shift offers a unified and actionable lens for understanding embedding collapse and for diagnosing when anisotropy becomes harmful.

I Extended the Trending mcp-brasil Project with AI Generation — Full Tutorial

Dev.to

The Rise of Self-Evolving AI: From Stanford Theory to Google AlphaEvolve and Berkeley OpenSage

Dev.to

AI 自主演化的時代來臨：從 Stanford 理論到 Google AlphaEvolve 與 Berkeley OpenSage

Dev.to

Neural Networks in Mobile Robot Motion

Dev.to

Retraining vs Fine-tuning or Transfer Learning? [D]

Reddit r/MachineLearning

Semantic Shift: the Fundamental Challenge in Text Embedding and Retrieval

Key Points

Abstract

Related Articles

I Extended the Trending mcp-brasil Project with AI Generation — Full Tutorial

The Rise of Self-Evolving AI: From Stanford Theory to Google AlphaEvolve and Berkeley OpenSage

AI 自主演化的時代來臨：從 Stanford 理論到 Google AlphaEvolve 與 Berkeley OpenSage

Neural Networks in Mobile Robot Motion

Retraining vs Fine-tuning or Transfer Learning? [D]

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer