NameBERT: Scaling Name-Based Nationality Classification with LLM-Augmented Open Academic Data

arXiv cs.CL / 4/14/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes NameBERT, a method for scaling name-based nationality classification by building a large dataset from the Open Academic Graph (OAG) rather than relying on small, source-specific labeled data.
  • It uses LLMs as “dataset enrichers” to generate additional names for low-resource countries, avoiding the high latency and cost of running LLMs as direct inference engines at deployment time.
  • Experiments show that performance gains are especially large when evaluation includes synthetic “tail” names, and there is still a modest improvement on tail-country metrics even when using real data only.
  • The resulting NameBERT models outperform state-of-the-art baselines on both in-domain and out-of-domain tasks while remaining efficient for large-scale inference compared with pure LLM-based approaches.
  • The work targets downstream needs such as equity and bias monitoring, personalization, and research applications in biomedical and sociological studies.

Abstract

Inferring nationality from personal names is a critical capability for equity and bias monitoring, personalization, and a valuable tool in biomedical and sociological research. However, existing name-based nationality classifiers are typically trained on relatively small or source-specific labeled datasets, which can introduce coverage gaps and limit performance for underrepresented countries. While large language models (LLMs) demonstrate strong zero-shot performance for name-based nationality prediction, their computational cost and latency make them impractical for real-time, large-scale deployment. In this work, we created a large-scale name-nationality dataset from the Open Academic Graph (OAG) and introduce a framework that leverages LLMs as dataset enrichers rather than inference engines. We augment low-resource countries with LLM-generated names and evaluate on real and synthetic-tail test sets. We find that augmentation produces large gains when evaluation includes synthetic tail names and still offers a modest lift on tail-country metrics otherwise. Overall, NameBERT models achieve significantly higher accuracy than state-of-the-art baselines across both in- and out-of-domain tasks, while remaining efficient for large-scale inference compared to LLMs.