Machine learning and digital pragmatics: Which word category influences emoji use most?

arXiv cs.LG / 4/24/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The study uses a fine-tuned MARBERT machine learning model to predict emoji usage from Arabic tweets, focusing on multiple Arabic colloquial dialects.
  • A dataset of 8,695 Arabic colloquial tweets was collected and labeled by classifying tweets into 14 emoji-related categories using a numerically encoded scheme.
  • An interpretable preprocessing baseline was built to examine how lexical (word) features relate to different emoji categories.
  • The model achieved an overall accuracy of 0.75, evaluated using precision, recall, and F1-score metrics.
  • The authors conclude that results are promising but emphasize the need to improve ML approaches for low-resource, multidialectal languages such as Arabic.

Abstract

This study investigates Machine Learning (ML) in the prediction of emojis in Arabic tweets employing the (state-of-the-art) MARBERT model. A corpus of 11379 CA tweets representing multiple Arabic colloquial dialects was collected from X.com via Python. A net dataset includes 8695 tweets, which were utilized for the analysis. These tweets were then classified into 14 categories, which were numerically encoded and used as labels. A preprocessing pipeline was designed as an interpretable baseline, allowing us to examine the relationship between lexical features and emoji categories. MARBERT was finetuned to predict emoji use from textual input. We evaluated the model performance in terms of precision, recall and F1-scores. Findings reveal that the model performed quite well with an overall accuracy 0.75. The study concludes that although the findings are promising, there is still a need for improving machine learning models including MARBERT, specifically for low-resource and multidialectal languages like Arabic.