A Taxonomy of Programming Languages for Code Generation

arXiv cs.CL / 4/3/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes the first reproducible taxonomy of programming languages for code generation by classifying 646 languages into four resource tiers.
  • It finds a highly skewed resource distribution: only 1.9% of languages in Tier 3 (High) produce 74.6% of all code tokens across seven major corpora.
  • Conversely, 71.7% of languages in Tier 0 (Scarce) contribute just 1.0% of tokens, indicating extreme and systematic imbalance in available code data.
  • The authors validate the imbalance using statistical measures (inequality, dispersion, and distributional skew) and argue it is critical for fair dataset curation.
  • The taxonomy is intended to enable tier-aware evaluation of multilingual LLMs for code generation, improving how performance is measured across language resource levels.

Abstract

The world's 7,000+ languages vary widely in the availability of resources for NLP, motivating efforts to systematically categorize them by their degree of resourcefulness (Joshi et al., 2020). A similar disparity exists among programming languages (PLs); however, no resource-tier taxonomy has been established for code. As large language models (LLMs) grow increasingly capable of generating code, such a taxonomy becomes essential. To fill this gap, we present the first reproducible PL resource classification, grouping 646 languages into four tiers. We show that only 1.9% of languages (Tier 3, High) account for 74.6% of all tokens in seven major corpora, while 71.7% of languages (Tier 0, Scarce) contribute just 1.0%. Statistical analyses of within-tier inequality, dispersion, and distributional skew confirm that this imbalance is both extreme and systematic. Our results provide a principled framework for dataset curation and tier-aware evaluation of multilingual LLMs.