\textit{Versteasch du mi?} Computational and Socio-Linguistic Perspectives on GenAI, LLMs, and Non-Standard Language

arXiv cs.CL / 3/31/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that current GenAI and LLM design can be unfair to less-spoken languages and may widen the digital language divide rather than reduce it.
It connects these outcomes to sociolinguistic critique, claiming that models are shaped by earlier processes of language standardisation tied to European nationalist and colonial histories.
Using South Tyrolean dialects and Kurdish varieties as case studies, the authors examine how linguistic non-standardness interacts with model behavior and standardisation pressures.
The work discusses both technical approaches to make LLMs handle non-standard language and the policy question of whether such efforts can support “democratic and decolonial” digital and machine learning strategies.

Abstract

The design of Large Language Models and generative artificial intelligence has been shown to be "unfair" to less-spoken languages and to deepen the digital language divide. Critical sociolinguistic work has also argued that these technologies are not only made possible by prior socio-historical processes of linguistic standardisation, often grounded in European nationalist and colonial projects, but also exacerbate epistemologies of language as "monolithic, monolingual, syntactically standardized systems of meaning". In our paper, we draw on earlier work on the intersections of technology and language policy and bring our respective expertise in critical sociolinguistics and computational linguistics to bear on an interrogation of these arguments. We take two different complexes of non-standard linguistic varieties in our respective repertoires--South Tyrolean dialects, which are widely used in informal communication in South Tyrol, Italy, as well as varieties of Kurdish--as starting points to an interdisciplinary exploration of the intersections between GenAI and linguistic variation and standardisation. We discuss both how LLMs can be made to deal with nonstandard language from a technical perspective, and whether, when or how this can contribute to "democratic and decolonial digital and machine learning strategies", which has direct policy implications.