Development of a European Union Time-Indexed Reference Dataset for Assessing the Performance of Signal Detection Methods in Pharmacovigilance using a Large Language Model

arXiv cs.CL / 3/30/2026

💬 OpinionSignals & Early TrendsModels & Research

Key Points

  • The study proposes a time-indexed reference dataset for the EU that captures when adverse events (AEs) are officially recognized in Summaries of Product Characteristics (SmPCs), enabling evaluation of early signal detection rather than only pre-confirmation periods.
  • It compiles EU centrally authorized products (1,513) using EU Union Register data locked at 15 Dec 2025, extracting Section 4.8 and identifying drug-AE relations via DeepSeek V3.
  • The resulting dataset contains 17,763 SmPC versions from 1995–2025 and 125,026 drug-AE associations, and a restricted reference set for active products with 1,479 medicinal products and 110,823 drug-AE associations.
  • The analysis shows most AE inclusions occurred pre-marketing (74.5%) with safety update activity peaking around 2012, and highlights major representation by gastrointestinal, skin, and nervous system System Organ Classes.
  • By attaching regulatory metadata and labeling-change timing, the dataset is positioned to improve and standardize benchmarking of pharmacovigilance signal detection methods and comparisons across approaches.

Abstract

Background: The identification of optimal signal detection methods is hindered by the lack of reliable reference datasets. Existing datasets do not capture when adverse events (AEs) are officially recognized by regulatory authorities, preventing restriction of analyses to pre-confirmation periods and limiting evaluation of early detection performance. This study addresses this gap by developing a time-indexed reference dataset for the European Union (EU), incorporating the timing of AE inclusion in product labels along with regulatory metadata. Methods: Current and historical Summaries of Product Characteristics (SmPCs) for all centrally authorized products (n=1,513) were retrieved from the EU Union Register of Medicinal Products (data lock: 15 December 2025). Section 4.8 was extracted and processed using DeepSeek V3 to identify AEs. Regulatory metadata, including labelling changes, were programmatically extracted. Time indexing was based on the date of AE inclusion in the SmPC. Results: The database includes 17,763 SmPC versions spanning 1995-2025, comprising 125,026 drug-AE associations. The time-indexed reference dataset, restricted to active products, included 1,479 medicinal products and 110,823 drug-AE associations. Most AEs were identified pre-marketing (74.5%) versus post-marketing (25.5%). Safety updates peaked around 2012. Gastrointestinal, skin, and nervous system disorders were the most represented System Organ Classes. Drugs had a median of 48 AEs across 14 SOCs. Conclusions: The proposed dataset addresses a critical gap in pharmacovigilance by incorporating temporal information on AE recognition for the EU, supporting more accurate assessment of signal detection performance and facilitating methodological comparisons across analytical approaches.