A Semi-Automated Annotation Workflow for Paediatric Histopathology Reports Using Small Language Models

arXiv cs.CL / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The study tackles the challenge of extracting structured clinical data from unstructured paediatric histopathology EPR text without relying on cloud LLM services that raise privacy concerns.
  • It proposes a resource-efficient semi-automated annotation workflow that uses small language models framed as clinician-guided question answering, with few-shot examples and domain-specific entity guidelines.
  • Using paediatric renal biopsy reports as a constrained, well-characterized domain, the authors manually annotated 400 reports as a gold standard from a dataset of 2,111 at Great Ormond Street Hospital.
  • Across five instruction-tuned small language models, Gemma 2 2B achieved the best accuracy (84.3%), outperforming several off-the-shelf NLP baselines (e.g., spaCy at 74.3% and various biomedical QA models lower).
  • Clinician-written entity guidelines and few-shot prompting improved extraction accuracy (guidelines: +7–19%; few-shot: +6–38%), enabling effective CPU-only deployment with minimal clinician time, and the code is released publicly.

Abstract

Electronic Patient Record (EPR) systems contain valuable clinical information, but much of it is trapped in unstructured text, limiting its use for research and decision-making. Large language models can extract such information but require substantial computational resources to run locally, and sending sensitive clinical data to cloud-based services, even when deidentified, raises significant patient privacy concerns. In this study, we develop a resource-efficient semi-automated annotation workflow using small language models (SLMs) to extract structured information from unstructured EPR data, focusing on paediatric histopathology reports. As a proof-of-concept, we apply the workflow to paediatric renal biopsy reports, a domain chosen for its constrained diagnostic scope and well-defined underlying biology. We develop the workflow iteratively with clinical oversight across three meetings, manually annotating 400 reports from a dataset of 2,111 at Great Ormond Street Hospital as a gold standard, while developing an automated information extraction approach using SLMs. We frame extraction as a Question-Answering task grounded by clinician-guided entity guidelines and few-shot examples, evaluating five instruction-tuned SLMs with a disagreement modelling framework to prioritise reports for clinical review. Gemma 2 2B achieves the highest accuracy at 84.3%, outperforming off-the-shelf models including spaCy (74.3%), BioBERT-SQuAD (62.3%), RoBERTa-SQuAD (59.7%), and GLiNER (60.2%). Entity guidelines improved performance by 7-19% over the zero-shot baseline, and few-shot examples by 6-38%, though their benefits do not compound when combined. These results demonstrate that SLMs can extract structured information from specialised clinical domains on CPU-only infrastructure with minimal clinician involvement. Our code is available at https://github.com/gosh-dre/nlp_renal_biopsy.