The Geometric Alignment Tax: Tokenization vs. Continuous Geometry in Scientific Foundation Models

arXiv cs.LG / 4/7/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that scientific foundation models (biology/physics) often fail to preserve the underlying continuous geometry because of an intrinsic “Geometric Alignment Tax” created when continuous manifolds are forced through discrete categorical bottlenecks (tokenization/quantization).
In controlled synthetic experiments, keeping the encoder constant while replacing cross-entropy with a continuous head can reduce geometric distortion by up to 8.5×, but learned codebooks show a non-monotonic effect where finer quantization can worsen geometry even as reconstruction improves.
Architecture and objective comparisons show that models optimized with continuous objectives differ only modestly (about 1.3×) whereas under discrete tokenization they diverge dramatically (about 3,000×), implying tokenization strongly amplifies geometric misalignment.
Using rate-distortion theory and MINE on 14 biological foundation models, the authors identify three failure regimes—Local-Global Decoupling, Representational Compression, and Geometric Vacuity—where geometry, mutual information, and global coherence cannot be optimized together.
A DNA-focused experiment suggests Evo 2’s reverse-complement robustness comes from conserved sequence composition rather than actually learned symmetry, reinforcing the limits of how token-based representations encode structure.

Abstract

Foundation models for biology and physics optimize predictive accuracy, but their internal representations systematically fail to preserve the continuous geometry of the systems they model. We identify the root cause: the Geometric Alignment Tax, an intrinsic cost of forcing continuous manifolds through discrete categorical bottlenecks. Controlled ablations on synthetic dynamical systems demonstrate that replacing cross-entropy with a continuous head on an identical encoder reduces geometric distortion by up to 8.5x, while learned codebooks exhibit a non-monotonic double bind where finer quantization worsens geometry despite improving reconstruction. Under continuous objectives, three architectures differ by 1.3x; under discrete tokenization, they diverge by 3,000x. Evaluating 14 biological foundation models with rate-distortion theory and MINE, we identify three failure regimes: Local-Global Decoupling, Representational Compression, and Geometric Vacuity. A controlled experiment confirms that Evo 2's reverse-complement robustness on real DNA reflects conserved sequence composition, not learned symmetry. No model achieves simultaneously low distortion, high mutual information, and global coherence.

[R] The ECIH: Model Modeling Agentic Identity as an Emergent Relational State [R]

Reddit r/MachineLearning

Google DeepMind Unveils Project Genie: The Dawn of Infinite AI-Generated Game Worlds

Dev.to

Artificial Intelligence and Life in 2030: The One Hundred Year Study onArtificial Intelligence

Dev.to

From Booth Chaos to Scalable Conversations: AI for Hyper-Personalized Follow-Up

Dev.to

AI in 2030: 20 Powerful Trends That Will Shape the Future

Dev.to

The Geometric Alignment Tax: Tokenization vs. Continuous Geometry in Scientific Foundation Models

Key Points

Abstract

Related Articles

[R] The ECIH: Model Modeling Agentic Identity as an Emergent Relational State [R]

Google DeepMind Unveils Project Genie: The Dawn of Infinite AI-Generated Game Worlds

Artificial Intelligence and Life in 2030: The One Hundred Year Study onArtificial Intelligence

From Booth Chaos to Scalable Conversations: AI for Hyper-Personalized Follow-Up

AI in 2030: 20 Powerful Trends That Will Shape the Future

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer