Large language models can disambiguate opioid slang on social media

arXiv cs.CL / 3/12/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study evaluates four state-of-the-art LLMs (GPT-4, GPT-5, Gemini 2.5 Pro, Claude Sonnet 4.5) on three slang-disambiguation tasks for opioid-related social media posts.
It defines three tasks: a lexicon-based disambiguation within posts, a lexicon-free detection of opioid-related content, and an emergent slang setting with simulated new slang terms.
Across tasks, LLMs outperform lexicon baselines, with lexicon-based F1 for the "fenty" subtask ~0.824-0.972 and for the "smack" subtask ~0.540-0.862, and lexicon-free F1 ~0.544-0.769; emergent slang metrics also favor LLMs (average accuracy 0.784, F1 0.712, precision 0.981, recall 0.587).
The authors conclude LLMs can identify relevant content for low-prevalence topics, enhancing data quality for downstream analyses and predictive models in opioid-crisis monitoring.

Abstract

Social media text shows promise for monitoring trends in the opioid overdose crisis; however, the overwhelming majority of social media text is unrelated to opioids. When leveraging social media text to monitor trends in the ongoing opioid overdose crisis, a common strategy for identifying relevant content is to use a lexicon of opioid-related terms as inclusion criteria. However, many slang terms for opioids, such as "smack" or "blues," have common non-opioid meanings, making them ambiguous. The advanced textual reasoning capability of large language models (LLMs) presents an opportunity to disambiguate these slang terms at scale. We present three tasks on which to evaluate four state-of-the-art LLMs (GPT-4, GPT-5, Gemini 2.5 Pro, and Claude Sonnet 4.5): a lexicon-based setting, in which the LLM must disambiguate a specific term within the context of a given post; a lexicon-free setting, in which the LLM must identify opioid-related posts from context without a lexicon; and an emergent slang setting, in which the LLM must identify opioid-related posts with simulated new slang terms. All four LLMs showed excellent performance across all tasks. In both subtasks of the lexicon-based setting, LLM F1 scores ("fenty" subtask: 0.824-0.972; "smack" subtask: 0.540-0.862) far exceeded those of the best lexicon strategy (0.126 and 0.009, respectively). In the lexicon-free task, LLM F1 scores (0.544-0.769) surpassed those of lexicons (0.080-0.540), and LLMs demonstrated uniformly higher recall. On emergent slang, all LLMs had higher accuracy (average: 0.784), F1 score (average: 0.712), precision (average: 0.981), and recall (average: 0.587) than the two lexicons assessed. Our results show that LLMs can be used to identify relevant content for low-prevalence topics, including but not limited to opioid references, enhancing data provided to downstream analyses and predictive models.