Measuring research data reuse in scholarly publications using generative artificial intelligence: Open Science Indicator development and preliminary results

arXiv cs.CL / 5/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • PLOS and DataSeer developed an LLM-based Open Science Indicator focused on measuring the downstream impact of open science, specifically the reuse of research data in scholarly publications.
  • Preliminary results indicate a 43% data reuse rate, which is higher than what traditional bibliometric approaches typically report.
  • The study finds that generative AI and LLMs can measure research data reuse at scale across publications.
  • The authors argue that the benefits of research data sharing and reuse may be currently underestimated due to limitations of existing measurement methods.

Abstract

Numerous metascience studies and other initiatives have begun to monitor the prevalence of open science practices when it is more important to understand the 'downstream' effects or impacts of open science. PLOS and DataSeer have developed a new LLM-based indicator to measure an important effect of open science: the reuse of research data. Our results show a data reuse rate of 43%, which is higher than established bibliometric techniques. We show that data reuse can be measured at scale using LLMs and generative artificial intelligence. The positive effects of research data sharing and reuse may currently be underestimated.