Agreement Between Large Language Models, Human Reviewers, and Authors in Evaluating STROBE Checklists for Observational Studies in Rheumatology
arXiv cs.AI / 3/23/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The study compares STROBE assessments from large language models (ChatGPT-5.2 and Gemini-3Pro), a five-person human reviewer panel, and the original manuscript authors across 17 rheumatology studies using the 22-item STROBE checklist.
- Overall inter-rater agreement was 85.0% (AC1 = 0.826), with almost perfect agreement in the Presentation and Context domain (AC1 = 0.841) and substantial agreement in Methodological Rigor (AC1 = 0.803).
- LLMs achieved complete agreement with human reviewers on standard formatting items but showed lower agreement on complex methodological items—for example, Gemini-3Pro and the senior reviewer on loss-to-follow-up items had AC1 = -0.252, and agreement with authors was only fair.
- ChatGPT-5.2 generally demonstrated higher agreement with human reviewers than Gemini-3Pro on certain methodological items.
- Conclusion: While LLMs show potential for basic STROBE screening, their lower agreement with human experts on complex items suggests they cannot yet replace expert judgment in evaluating observational research.
Related Articles
The Security Gap in MCP Tool Servers (And What I Built to Fix It)
Dev.to
Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to
I made a new programming language to get better coding with less tokens.
Dev.to
RSA Conference 2026: The Week Vibe Coding Security Became Impossible to Ignore
Dev.to

Adversarial AI framework reveals mechanisms behind impaired consciousness and a potential therapy
Reddit r/artificial