Using Embedding Models to Improve Probabilistic Race Prediction

arXiv cs.CL / 4/27/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses a key limitation of Bayesian Improved Surname Geocoding (BISG): Census surname data omit about 10% of the US population, causing prediction quality to drop for people with uncommon surnames.
  • It explains that standard BISG depends on an uninformative generic prior for omitted surname cases, which drives the observed degradation in racial disparity estimation.
  • The authors introduce embedding-powered BISG (eBISG), which represents names using pre-trained text embeddings and trains neural networks on 2020 Census surname and first-name data to infer race probabilities for names not covered by Census.
  • Five variants are evaluated, from surname-only BISG to progressively richer embedding models (surname embedding, surname+first-name embedding, and full-name embedding).
  • Results show monotonic improvements across the eBISG variants, with the full-name embedding providing the largest gains—especially for Hispanic and Asian voters whose surnames are missing from Census lists.

Abstract

Estimating racial disparity requires individual-level race data, which are often unavailable due to the sensitivity of collecting such information. To address this problem, many researchers utilize Bayesian Improved Surname Geocoding (BISG), which have critically relied on Census surname data. Unfortunately, these data capture race-surname relationships only for common surnames, omitting approximately 10% of the US population. We show that predictive performance degrades substantially for individuals with such omitted, uncommon surnames because standard BISG implementation relies on a uninformative generic prior in these cases. To address this limitation, we propose embedding-powered BISG (eBISG), which uses pre-trained text embeddings to represent names as dense vectors and trains neural networks on 2020 Census surname and first-name data to estimate race probabilities for names not covered in the Census. We compare five approaches: standard BISG using only surnames, BIFSG incorporating first name probabilities, surname embedding for unlisted names, surname and first name embedding combining both, and a full-name embedding trained on voter file data from Southern states that captures interactions between name components. We show that each successive eBISG approach improves race prediction, with the full-name embedding yielding the largest gains, particularly for Hispanic and Asian voters whose surnames are absent from the Census list.