An Investigation of Linguistic Biases in LLM-Based Recommendations

arXiv cs.CL / 4/29/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The study examines how linguistic dialects—Southern American English, Indian English, and code-switched Hindi-English—affect LLM-based restaurant and product recommendations in a cold-start setting.
  • It uses the Yelp Open dataset and a Walmart reviews dataset, prompting multiple LLMs to choose the top-20 items from cuisine- and category-balanced name lists.
  • The researchers vary prompt sampling across 20 seeds, aggregate recommendation counts, and apply mixed-effects regression and likelihood-ratio tests to quantify dialect- and model-size effects.
  • Results indicate that dialect influences the kinds of restaurants recommended, with Mistral-small-3.1 and Llama-3.1 family models showing greater sensitivity to Indian English and code-switched prompts.
  • For product recommendations, Llama-3.1-70B is highly sensitive to code-switched prompts in most categories, and category shifts (e.g., more beauty/home recommendations) differ depending on whether prompts use Indian English or code-switching.

Abstract

We investigate linguistic biases in LLM-based restaurant and product recommendations given prompts varying across Southern American English (AE), Indian English (IE), and Code-Switched Hindi-English dialects, using the Yelp Open dataset (Yelp Inc., 2023) and Walmart product reviews dataset (PromptCloud,2020). We add lists of restaurant and product names balanced by cuisine type and product category to the prompts given to the LLM, and we zero-shot prompt the LLMs in a cold-start setting to select the top-20 restaurant and product recommendations from these lists for each of the dialect-varied prompts. We prompt LLMs using different list samples across 20 seeds for better generalization, and aggregate per cuisine-type and per category response counts for each seed, question/prompt, and LLM model. We run mixed-effects regression models for each model family and topic (restaurant/product) with the aggregate response counts as the dependent, and conduct likelihood ratio tests for the fixed effects with post-hoc pairwise testing of estimated marginal means differences, to investigate group-level differences in recommendation counts by model size and dialect type. Results show that dialect plays a role in the type of restaurant selected across the models tested with the mistral-small-3.1 model and both the llama-3.1 family models tested showing more sensitivity to Indian English and Code-Switched prompts. In terms of product recommendations, the llama-3.1-70B-model is particularly sensitive to Code-Switched prompts in four out of seven categories, and more beauty and home category recommendations are seen when using the Indian English and Code-Switched prompts for larger and smaller models, respectively. No broad trends are seen in the model-size based differences, with differing recommendations based on model sizes conditioned by the type of dialect.