Using Embedding Models to Improve Probabilistic Race Prediction
arXiv cs.CL / 4/27/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper addresses a key limitation of Bayesian Improved Surname Geocoding (BISG): Census surname data omit about 10% of the US population, causing prediction quality to drop for people with uncommon surnames.
- It explains that standard BISG depends on an uninformative generic prior for omitted surname cases, which drives the observed degradation in racial disparity estimation.
- The authors introduce embedding-powered BISG (eBISG), which represents names using pre-trained text embeddings and trains neural networks on 2020 Census surname and first-name data to infer race probabilities for names not covered by Census.
- Five variants are evaluated, from surname-only BISG to progressively richer embedding models (surname embedding, surname+first-name embedding, and full-name embedding).
- Results show monotonic improvements across the eBISG variants, with the full-name embedding providing the largest gains—especially for Hispanic and Asian voters whose surnames are missing from Census lists.
Related Articles

Subagents: The Building Block of Agentic AI
Dev.to

DeepSeek-V4 Models Could Change Global AI Race
AI Business

Got OpenAI's privacy filter model running on-device via ExecuTorch
Reddit r/LocalLLaMA

The Agent-Skill Illusion: Why Prompt-Based Control Fails in Multi-Agent Business Consulting Systems
Dev.to

We Built a Voice AI Receptionist in 8 Weeks — Every Decision We Made and Why
Dev.to