EnTaCs: Analyzing the Relationship Between Sentiment and Language Choice in English-Tamil Code-Switching

arXiv cs.CL / 3/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies how utterance sentiment influences language choice in English-Tamil code-switched text, combining machine learning with statistical modeling.
  • Using a fine-tuned XLM-RoBERTa model for token-level language identification on 35,650 romanized YouTube comments from the DravidianCodeMix dataset, the authors estimate English proportion and language switch frequency per utterance.
  • Linear regression results show that positive utterances have a higher English proportion (34.3%) than negative utterances (24.8%).
  • The analysis also finds that mixed-sentiment utterances correlate with the highest language switch frequency, after controlling for utterance length.
  • The findings support the idea that emotional content affects code-switching behavior through socio-linguistic associations of prestige and identity tied to matrix and embedded languages.

Abstract

This paper investigates the relationship between utterance sentiment and language choice in English-Tamil code-switched text, using methods from machine learning and statistical modelling. We apply a fine-tuned XLM-RoBERTa model for token-level language identification on 35,650 romanized YouTube comments from the DravidianCodeMix dataset, producing per-utterance measurements of English proportion and language switch frequency. Linear regression analysis reveals that positive utterances exhibit significantly greater English proportion (34.3%) than negative utterances (24.8%), and mixed-sentiment utterances show the highest language switch frequency when controlling for utterance length. These findings support the hypothesis that emotional content demonstrably influences language choice in multilingual code-switching settings, due to socio-linguistic associations of prestige and identity with embedded and matrix languages.