Convergent Evolution: How Different Language Models Learn Similar Number Representations

arXiv cs.CL / 4/23/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper finds that many text-trained language models represent numbers using periodic features with dominant periods at T=2, 5, and 10.
  • It identifies a two-level hierarchy: while many model families learn Fourier-domain “period-T” spike features, only some also learn geometrically separable features that enable linear classification of numbers mod-T.
  • The authors prove that Fourier-domain sparsity is necessary but not sufficient for achieving mod-T geometric separability, explaining why models converge on similar periodic signals yet differ in classification structure.
  • Experiments show that data, architecture, optimizer, and tokenizer jointly determine whether geometrically separable features emerge, with two main training routes to acquire them: (1) complementary co-occurrence signals in language data and (2) multi-token addition-style problems.
  • Overall, the work frames this as “convergent evolution” in feature learning, where different model types arrive at similar underlying number representations via different signals.

Abstract

Language models trained on natural text learn to represent numbers using periodic features with dominant periods at T=2, 5, 10. In this paper, we identify a two-tiered hierarchy of these features: while Transformers, Linear RNNs, LSTMs, and classical word embeddings trained in different ways all learn features that have period-T spikes in the Fourier domain, only some learn geometrically separable features that can be used to linearly classify a number mod-T. To explain this incongruity, we prove that Fourier domain sparsity is necessary but not sufficient for mod-T geometric separability. Empirically, we investigate when model training yields geometrically separable features, finding that the data, architecture, optimizer, and tokenizer all play key roles. In particular, we identify two different routes through which models can acquire geometrically separable features: they can learn them from complementary co-occurrence signals in general language data, including text-number co-occurrence and cross-number interaction, or from multi-token (but not single-token) addition problems. Overall, our results highlight the phenomenon of convergent evolution in feature learning: A diverse range of models learn similar features from different training signals.