Convergent Evolution: How Different Language Models Learn Similar Number Representations

arXiv cs.CL / 4/23/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper finds that many text-trained language models represent numbers using periodic features with dominant periods at T=2, 5, and 10.
It identifies a two-level hierarchy: while many model families learn Fourier-domain “period-T” spike features, only some also learn geometrically separable features that enable linear classification of numbers mod-T.
The authors prove that Fourier-domain sparsity is necessary but not sufficient for achieving mod-T geometric separability, explaining why models converge on similar periodic signals yet differ in classification structure.
Experiments show that data, architecture, optimizer, and tokenizer jointly determine whether geometrically separable features emerge, with two main training routes to acquire them: (1) complementary co-occurrence signals in language data and (2) multi-token addition-style problems.
Overall, the work frames this as “convergent evolution” in feature learning, where different model types arrive at similar underlying number representations via different signals.

Abstract

Language models trained on natural text learn to represent numbers using periodic features with dominant periods at

T=2, 5, 10

. In this paper, we identify a two-tiered hierarchy of these features: while Transformers, Linear RNNs, LSTMs, and classical word embeddings trained in different ways all learn features that have period-

T

spikes in the Fourier domain, only some learn geometrically separable features that can be used to linearly classify a number mod-

T

. To explain this incongruity, we prove that Fourier domain sparsity is necessary but not sufficient for mod-

T

geometric separability. Empirically, we investigate when model training yields geometrically separable features, finding that the data, architecture, optimizer, and tokenizer all play key roles. In particular, we identify two different routes through which models can acquire geometrically separable features: they can learn them from complementary co-occurrence signals in general language data, including text-number co-occurrence and cross-number interaction, or from multi-token (but not single-token) addition problems. Overall, our results highlight the phenomenon of convergent evolution in feature learning: A diverse range of models learn similar features from different training signals.

Just what the doctor ordered: how AI could help China bridge the medical resources gap

SCMP Tech

Why don't Automatic speech Recognition models use prompting? [D]

Reddit r/MachineLearning

Automating Advanced Customization in Your Music Studio

Dev.to

CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos

Dev.to

My AI Agent Over-Corrected Itself — So I Built Metabolic Regulation

Dev.to

Convergent Evolution: How Different Language Models Learn Similar Number Representations

Key Points

Abstract

Related Articles

Just what the doctor ordered: how AI could help China bridge the medical resources gap

Why don't Automatic speech Recognition models use prompting? [D]

Automating Advanced Customization in Your Music Studio

CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos

My AI Agent Over-Corrected Itself — So I Built Metabolic Regulation

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer