Agreement Between Large Language Models, Human Reviewers, and Authors in Evaluating STROBE Checklists for Observational Studies in Rheumatology

arXiv cs.AI / 3/23/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study compares STROBE assessments from large language models (ChatGPT-5.2 and Gemini-3Pro), a five-person human reviewer panel, and the original manuscript authors across 17 rheumatology studies using the 22-item STROBE checklist.
Overall inter-rater agreement was 85.0% (AC1 = 0.826), with almost perfect agreement in the Presentation and Context domain (AC1 = 0.841) and substantial agreement in Methodological Rigor (AC1 = 0.803).
LLMs achieved complete agreement with human reviewers on standard formatting items but showed lower agreement on complex methodological items—for example, Gemini-3Pro and the senior reviewer on loss-to-follow-up items had AC1 = -0.252, and agreement with authors was only fair.
ChatGPT-5.2 generally demonstrated higher agreement with human reviewers than Gemini-3Pro on certain methodological items.
Conclusion: While LLMs show potential for basic STROBE screening, their lower agreement with human experts on complex items suggests they cannot yet replace expert judgment in evaluating observational research.

Abstract

Introduction: Evaluating compliance with the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement can be time-consuming and subjective. This study compares STROBE assessments from large language models (LLMs), a human reviewer panel, and the original manuscript authors in observational rheumatology research. Methods: Guided by the GRRAS and DEAL Pathway B frameworks, 17 rheumatology articles were independently assessed. Evaluations used the 22-item STROBE checklist, completed by the authors, a five-person human panel (ranging from junior to senior professionals), and two LLMs (ChatGPT-5.2, Gemini-3Pro). Items were grouped into Methodological Rigor and Presentation and Context domains. Inter-rater reliability was calculated using Gwet's Agreement Coefficient (AC1). Results: Overall agreement across all reviewers was 85.0% (AC1=0.826). Domain stratification showed almost perfect agreement for Presentation and Context (AC1=0.841) and substantial agreement for Methodological Rigor (AC1=0.803). Although LLMs achieved complete agreement (AC1=1.000) with all human reviewers on standard formatting elements, their agreement with human reviewers and authors declined on complex items. For example, regarding the item on loss to follow-up, the agreement between Gemini 3 Pro and the senior reviewer was AC1=-0.252, while the agreement with the authors was only fair. Additionally, ChatGPT-5.2 generally demonstrated higher agreement with human reviewers than Gemini-3Pro on specific methodological items. Conclusion: While LLMs show potential for basic STROBE screening, their lower agreement with human experts on complex methodological items likely reflects a reliance on surface-level information. Currently, these models appear more reliable for standardizing straightforward checks than for replacing expert human judgment in evaluating observational research.

The Security Gap in MCP Tool Servers (And What I Built to Fix It)

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

I made a new programming language to get better coding with less tokens.

Dev.to

RSA Conference 2026: The Week Vibe Coding Security Became Impossible to Ignore

Dev.to

Adversarial AI framework reveals mechanisms behind impaired consciousness and a potential therapy

Reddit r/artificial

Agreement Between Large Language Models, Human Reviewers, and Authors in Evaluating STROBE Checklists for Observational Studies in Rheumatology

Key Points

Abstract

Related Articles

The Security Gap in MCP Tool Servers (And What I Built to Fix It)

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

I made a new programming language to get better coding with less tokens.

RSA Conference 2026: The Week Vibe Coding Security Became Impossible to Ignore

Adversarial AI framework reveals mechanisms behind impaired consciousness and a potential therapy

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer