A Taxonomy of Programming Languages for Code Generation
arXiv cs.CL / 4/3/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes the first reproducible taxonomy of programming languages for code generation by classifying 646 languages into four resource tiers.
- It finds a highly skewed resource distribution: only 1.9% of languages in Tier 3 (High) produce 74.6% of all code tokens across seven major corpora.
- Conversely, 71.7% of languages in Tier 0 (Scarce) contribute just 1.0% of tokens, indicating extreme and systematic imbalance in available code data.
- The authors validate the imbalance using statistical measures (inequality, dispersion, and distributional skew) and argue it is critical for fair dataset curation.
- The taxonomy is intended to enable tier-aware evaluation of multilingual LLMs for code generation, improving how performance is measured across language resource levels.
Related Articles

90000 Tech Workers Got Fired This Year and Everyone Is Blaming AI but Thats Not the Whole Story
Dev.to

Microsoft’s $10 Billion Japan Bet Shows the Next AI Battleground Is National Infrastructure
Dev.to

TII Releases Falcon Perception: A 0.6B-Parameter Early-Fusion Transformer for Open-Vocabulary Grounding and Segmentation from Natural Language Prompts
MarkTechPost

The house asked me a question
Dev.to

Precision Clip Selection: How AI Suggests Your In and Out Points
Dev.to