Founder effects shape the evolutionary dynamics of multimodality in open LLM families

arXiv cs.AI / 3/25/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The study examines how multimodal (vision-language) capabilities emerge over time in open LLM families using Hugging Face ModelBiome lineage and metadata from more than 1.8M model entries.
  • Multimodal cross-modal work exists widely in the broader ecosystem before it becomes common within major open LLM families, remaining rare through 2023 and most of 2024 before rising sharply in 2024–2025.
  • Across families, vision-language model (VLM) variants typically debut months after first text-generation releases, with observed lags ranging from about 1 month (Gemma) to over a year for several families and ~26 months for GLM.
  • Lineage analysis finds weak transfer from text-generation parents to VLM descendants (only 0.218% of fine-tuning edges from text parents lead to VLMs), while most multimodal expansion occurs within existing VLM lineages (94.5% of VLM-child edges originate from VLM parents).
  • Many VLM releases appear as “new roots” without recorded parents (~60%), and founder concentration patterns suggest punctuated adoption: rare founder events seed multimodality, followed by rapid within-lineage amplification and diversification.

Abstract

Large language model (LLM) families are improving rapidly, yet it remains unclear how quickly multimodal capabilities emerge and propagate within open families. Using the ModelBiome AI Ecosystem dataset of Hugging Face model metadata and recorded lineage fields (>1.8x10^6 model entries), we quantify multimodality over time and along recorded parent-to-child relations. Cross-modal tasks are widespread in the broader ecosystem well before they become common within major open LLM families: within these families, multimodality remains rare through 2023 and most of 2024, then increases sharply in 2024-2025 and is dominated by image-text vision-language tasks. Across major families, the first vision-language model (VLM) variants typically appear months after the first text-generation releases, with lags ranging from ~1 month (Gemma) to more than a year for several families and ~26 months for GLM. Lineage-conditioned transition rates show weak cross-type transfer: among fine-tuning edges from text-generation parents, only 0.218% yield VLM descendants. Instead, multimodality expands primarily within existing VLM lineages: 94.5% of VLM-child fine-tuning edges originate from VLM parents, versus 4.7% from text-generation parents. At the model level, most VLM releases appear as new roots without recorded parents (~60%), while the remainder are predominantly VLM-derived; founder concentration analyses indicate rapid within-lineage amplification followed by diversification. Together, these results show that multimodality enters open LLM families through rare founder events and then expands rapidly within their descendant lineages, producing punctuated adoption dynamics that likely induce distinct, transfer-limited scaling behavior for multimodal capabilities.