Agreement Between Large Language Models, Human Reviewers, and Authors in Evaluating STROBE Checklists for Observational Studies in Rheumatology

arXiv cs.AI / 3/23/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The study compares STROBE assessments from large language models (ChatGPT-5.2 and Gemini-3Pro), a five-person human reviewer panel, and the original manuscript authors across 17 rheumatology studies using the 22-item STROBE checklist.
  • Overall inter-rater agreement was 85.0% (AC1 = 0.826), with almost perfect agreement in the Presentation and Context domain (AC1 = 0.841) and substantial agreement in Methodological Rigor (AC1 = 0.803).
  • LLMs achieved complete agreement with human reviewers on standard formatting items but showed lower agreement on complex methodological items—for example, Gemini-3Pro and the senior reviewer on loss-to-follow-up items had AC1 = -0.252, and agreement with authors was only fair.
  • ChatGPT-5.2 generally demonstrated higher agreement with human reviewers than Gemini-3Pro on certain methodological items.
  • Conclusion: While LLMs show potential for basic STROBE screening, their lower agreement with human experts on complex items suggests they cannot yet replace expert judgment in evaluating observational research.

Abstract

Introduction: Evaluating compliance with the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement can be time-consuming and subjective. This study compares STROBE assessments from large language models (LLMs), a human reviewer panel, and the original manuscript authors in observational rheumatology research. Methods: Guided by the GRRAS and DEAL Pathway B frameworks, 17 rheumatology articles were independently assessed. Evaluations used the 22-item STROBE checklist, completed by the authors, a five-person human panel (ranging from junior to senior professionals), and two LLMs (ChatGPT-5.2, Gemini-3Pro). Items were grouped into Methodological Rigor and Presentation and Context domains. Inter-rater reliability was calculated using Gwet's Agreement Coefficient (AC1). Results: Overall agreement across all reviewers was 85.0% (AC1=0.826). Domain stratification showed almost perfect agreement for Presentation and Context (AC1=0.841) and substantial agreement for Methodological Rigor (AC1=0.803). Although LLMs achieved complete agreement (AC1=1.000) with all human reviewers on standard formatting elements, their agreement with human reviewers and authors declined on complex items. For example, regarding the item on loss to follow-up, the agreement between Gemini 3 Pro and the senior reviewer was AC1=-0.252, while the agreement with the authors was only fair. Additionally, ChatGPT-5.2 generally demonstrated higher agreement with human reviewers than Gemini-3Pro on specific methodological items. Conclusion: While LLMs show potential for basic STROBE screening, their lower agreement with human experts on complex methodological items likely reflects a reliance on surface-level information. Currently, these models appear more reliable for standardizing straightforward checks than for replacing expert human judgment in evaluating observational research.