NameBERT: Scaling Name-Based Nationality Classification with LLM-Augmented Open Academic Data

arXiv cs.CL / 4/14/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes NameBERT, a method for scaling name-based nationality classification by building a large dataset from the Open Academic Graph (OAG) rather than relying on small, source-specific labeled data.
It uses LLMs as “dataset enrichers” to generate additional names for low-resource countries, avoiding the high latency and cost of running LLMs as direct inference engines at deployment time.
Experiments show that performance gains are especially large when evaluation includes synthetic “tail” names, and there is still a modest improvement on tail-country metrics even when using real data only.
The resulting NameBERT models outperform state-of-the-art baselines on both in-domain and out-of-domain tasks while remaining efficient for large-scale inference compared with pure LLM-based approaches.
The work targets downstream needs such as equity and bias monitoring, personalization, and research applications in biomedical and sociological studies.

Abstract

Inferring nationality from personal names is a critical capability for equity and bias monitoring, personalization, and a valuable tool in biomedical and sociological research. However, existing name-based nationality classifiers are typically trained on relatively small or source-specific labeled datasets, which can introduce coverage gaps and limit performance for underrepresented countries. While large language models (LLMs) demonstrate strong zero-shot performance for name-based nationality prediction, their computational cost and latency make them impractical for real-time, large-scale deployment. In this work, we created a large-scale name-nationality dataset from the Open Academic Graph (OAG) and introduce a framework that leverages LLMs as dataset enrichers rather than inference engines. We augment low-resource countries with LLM-generated names and evaluate on real and synthetic-tail test sets. We find that augmentation produces large gains when evaluation includes synthetic tail names and still offers a modest lift on tail-country metrics otherwise. Overall, NameBERT models achieve significantly higher accuracy than state-of-the-art baselines across both in- and out-of-domain tasks, while remaining efficient for large-scale inference compared to LLMs.