Fine-Tuning A Large Language Model for Systematic Review Screening
arXiv cs.CL / 3/27/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The study investigates why prior LLM approaches to systematic review screening have produced inconsistent results, arguing that prompting alone lacks sufficient context for strong performance.
- Researchers fine-tuned a small 1.2B-parameter open-weight LLM specifically for title and abstract screening using human ratings from a dataset of 8,500+ records.
- The fine-tuned model substantially outperformed the base model, achieving an 80.79% improvement in weighted F1 score.
- On the full dataset of 8,277 studies, the fine-tuned model matched human coders with 86.40% agreement, including a 91.18% true positive rate and 86.38% true negative rate.
- The authors report stable behavior across repeated inference runs with perfect agreement, concluding that fine-tuning may be promising for large-scale systematic review workflows.
広告
Related Articles

STADLER reshapes knowledge work at a 230-year-old company
OpenAI Blog

AI Research Is Getting Harder to Separate From Geopolitics
Wired
Sparse Federated Representation Learning for circular manufacturing supply chains with zero-trust governance guarantees
Dev.to

Meet Claude Mythos: Leaked Anthropic post reveals the powerful upcoming model
Reddit r/artificial

**Optimizing AI Agents: A Little-Known Technique to Improve
Dev.to