Assessing Capabilities of Large Language Models in Social Media Analytics: A Multi-task Quest

arXiv cs.CL / 4/22/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

The study provides a first comprehensive evaluation of modern large language models (GPT-4/4o/3.5-Turbo, Gemini 1.5 Pro, DeepSeek-V3, Llama 3.2, and BERT) on three social media analytics tasks using an X (Twitter) dataset.
For authorship verification, the authors introduce a systematic sampling framework and test generalization on newly collected tweets from January 2024 onward to reduce “seen-data” bias.
For post generation, the work evaluates how well LLMs can generate authentic, user-like content using comprehensive quantitative metrics and a user study focused on perceived authenticity conditioned on users’ own writing.
For user attribute inference, the authors annotate occupations and interests with standardized taxonomies (IAB Tech Lab 2023 and U.S. SOC 2018) and benchmark LLM performance against existing baselines.
The paper contributes unified, reproducible benchmarks and states that code and data will be made publicly available with publication.
(Note: The original request asks for 3-5 bullets; the article’s content spans more than 4 distinct aspects, so I included 5 bullets.)

Abstract

In this study, we present the first comprehensive evaluation of modern LLMs - including GPT-4, GPT-4o, GPT-3.5-Turbo, Gemini 1.5 Pro, DeepSeek-V3, Llama 3.2, and BERT - across three core social media analytics tasks on a Twitter (X) dataset: (I) Social Media Authorship Verification, (II) Social Media Post Generation, and (III) User Attribute Inference. For the authorship verification, we introduce a systematic sampling framework over diverse user and post selection strategies and evaluate generalization on newly collected tweets from January 2024 onward to mitigate "seen-data" bias. For post generation, we assess the ability of LLMs to produce authentic, user-like content using comprehensive evaluation metrics. Bridging Tasks I and II, we conduct a user study to measure real users' perceptions of LLM-generated posts conditioned on their own writing. For attribute inference, we annotate occupations and interests using two standardized taxonomies (IAB Tech Lab 2023 and 2018 U.S. SOC) and benchmark LLMs against existing baselines. Overall, our unified evaluation provides new insights and establishes reproducible benchmarks for LLM-driven social media analytics. The code and data are provided in the supplementary material and will also be made publicly available upon publication.