Language Specific Knowledge: Do Models Know Better in X than in English?

arXiv cs.CL / 4/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that multilingual LLMs’ standard goal of mapping semantically similar content into a shared latent space has important limitations, and that changing the query language can improve question-answering quality.
It introduces the concept of Language Specific Knowledge (LSK), where certain questions are best answered in an “expert language” tailored to a specific LLM.
The authors frame “language selection” as a task: for some queries, models perform better in non-English languages, including sometimes low-resource languages.
They propose and evaluate several “simple-to-strong” baselines, including their LSKExtractor method, using datasets covering cultural and social behavioral norms.
Experiments show counterintuitive language-to-knowledge patterns across models (e.g., Gemma performing best on China/Middle East knowledge in Spanish, while Qwen performs best on authority/responsibility in Arabic and Chinese), and they position these findings as helpful for deploying more culturally and linguistically aligned open models.

Abstract

Often, multilingual language models are trained with the objective to map semantically similar content (in different languages) in the same latent space. In this paper, we show a nuance in this training objective, and find that by changing the language of the input query, we can improve the question answering ability of language models. We make two main contributions. First, we introduce the term Language Specific Knowledge (LSK) to denote queries that are best answered in an ``expert language'' for a given LLM, thereby enhancing its question-answering ability. We introduce the problem of language selection -- for some queries, language models can perform better when queried in languages other than English, sometimes even better in low-resource languages -- and the goal is to select the optimal language for the query. Second, we introduce a variety of simple to strong baselines to empirically motivate the language selection problem (including one of our own methods called LSKExtractor). During our evaluation, we employ three datasets that contain knowledge about both cultural and social behavioral norms. Overall, the results show that principled language selection can improve the performance of a language model, and that the expected question-to-language map is not always intuitive: Gemma models know most about China and Middle East in Spanish; Qwen models know most about authority and responsibility in Arabic and Chinese. Broadly, our research contributes to the open-source development of language models that are inclusive and more aligned with the cultural and linguistic contexts in which they are deployed.