EnTaCs: Analyzing the Relationship Between Sentiment and Language Choice in English-Tamil Code-Switching

arXiv cs.CL / 3/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies how utterance sentiment influences language choice in English-Tamil code-switched text, combining machine learning with statistical modeling.
Using a fine-tuned XLM-RoBERTa model for token-level language identification on 35,650 romanized YouTube comments from the DravidianCodeMix dataset, the authors estimate English proportion and language switch frequency per utterance.
Linear regression results show that positive utterances have a higher English proportion (34.3%) than negative utterances (24.8%).
The analysis also finds that mixed-sentiment utterances correlate with the highest language switch frequency, after controlling for utterance length.
The findings support the idea that emotional content affects code-switching behavior through socio-linguistic associations of prestige and identity tied to matrix and embedded languages.

Abstract

This paper investigates the relationship between utterance sentiment and language choice in English-Tamil code-switched text, using methods from machine learning and statistical modelling. We apply a fine-tuned XLM-RoBERTa model for token-level language identification on 35,650 romanized YouTube comments from the DravidianCodeMix dataset, producing per-utterance measurements of English proportion and language switch frequency. Linear regression analysis reveals that positive utterances exhibit significantly greater English proportion (34.3%) than negative utterances (24.8%), and mixed-sentiment utterances show the highest language switch frequency when controlling for utterance length. These findings support the hypothesis that emotional content demonstrably influences language choice in multilingual code-switching settings, due to socio-linguistic associations of prestige and identity with embedded and matrix languages.